Rate Selection
By the time rate selection runs, each ROID has dozens of candidate rates across raw, transformed, and imputed columns — all scored 0–7. Rate selection picks the one winner per ROID and writes it to canonical_rate.
Overview
How All Sources Come Together
Rate selection reads from tmp_int_accuracy_brit — the final wide table produced by BRIT accuracy. Every rate column and its _validation_score counterpart are present on a single row per ROID. Rate selection does not join anything new; it just scans what's already there.
Selection sequence:
- BRIT Accuracy Table — One row per ROID. Every rate column (raw, transformed, imputed) with its
_validation_score. Input to rate selection. →tmp_int_accuracy_brit - Build Rate Arrays — Collect all non-NULL rate columns into parallel arrays: rate values, scores, sources, types, methodologies. Array position encodes priority for tiebreaking. → (CTE inside
combined_main.sql) - Select Canonical Rate —
array_max(rate_score_array)finds the highest score.ARRAY_POSITIONextracts the index. Allcanonical_*fields are populated from that index. →tmp_int_combined_no_whisp - Add Whispers — Optional enrichment layer. Summary statistics at payer/network/provider/code aggregation levels are joined in. →
tmp_int_combined
Canonical Rate Score (1–5)
After selection, the internal 0–7 accuracy score is translated into a user-facing 1–5 scale:
| Canonical Score | Internal Score | Meaning |
|---|---|---|
| 5 | 7.x | Payer + hospital MRF independently agreed within ±20% |
| 4 | 6.x | Single MRF source, within outlier bounds (not cross-validated) |
| 3 | 4.x or 5.x | Within bounds, no counterparty validation |
| 2 | 2.x or 3.x | Imputed estimate (enhanced or benchmark-backed) |
| 1 | 1.x | Outlier — outside all acceptable bounds, last resort |
| 0 | 0 | No rate — ROID has no usable data from any source |
QA tip: When checking CLD output, always filter on canonical_rate_score > 1 to exclude outliers, or canonical_rate_score > 0 to exclude empty ROIDs. A NULL canonical_rate means no data was found at any tier — the ROID still exists in the output, just with nulls.
Output Columns
| Column | Meaning | Example |
|---|---|---|
canonical_rate | The selected dollar amount | 18500.00 |
canonical_rate_score | 1–5 confidence (user-facing) | 5 |
canonical_rate_source | Which source won ("payer", "hospital", "imputation", "payer_hospital") | "payer_hospital" |
canonical_rate_type | Exact rate column that won, prefixed by category | "raw: payer_negotiated_rate" |
canonical_contract_methodology | Contract methodology of the winning rate | "Fee Schedule" |
canonical_rate_class | "Raw", "Transform", or "Impute" | "Raw" |
canonical_gross_charge_type | Gross charge source used (for enhanced rates) | "mrf_gross_charge_cbsa_median" |
Rate Type Categories
Every rate in the CLD output belongs to one of four categories. These categories reflect how defensible the rate is: where the data came from and how much estimation was involved.
| Category | Source | Description | Typical canonical_rate_type prefix |
|---|---|---|---|
| Posted Rates | Payer MRF, Hospital MRF | Fixed dollar amounts from MRF files. No estimation involved. The most defensible category. | raw: payer_*, raw: hospital_*_dollar, impute: msdrg_mrf_base_rate_* |
| Real-World Rates | Payer MRF, Hospital MRF, Komodo claims | Variable-rate provisions where all components come from reported data (e.g. 90% of charges × hospital's own reported charge). Rate may fluctuate claim-to-claim but is fully grounded in source data. | raw: hospital_*_allowed_amount, transform: hosp_per_diem_mult_glos, transform: *_gc_hosp_perc_to_dol |
| Enhanced Rates | MRF provision + estimated benchmark component | MRF reports one component (e.g. 90% of charges) but not the charge itself. Turquoise fills in the missing component from Komodo or geographic averages. | transform: *_gc_komodo_perc_to_dol, transform: *_gc_hosp_cbsa_perc_to_dol, most impute:* |
| Benchmark Rates | Medicare reference data | Entirely derived from reference or benchmark data — no reliable information from MRFs was available. Last resort. | benchmark_* columns |
Tie-Breaking Priority
When two rates have the same accuracy score, the category hierarchy determines the winner:
| Priority | Category | Rationale |
|---|---|---|
| 1 | Posted Rates | All components from reported data, no estimation |
| 2 | Real-World Rates | All components from reported data, but rate varies by claim |
| 3 | Enhanced Rates | Provision reported, but one component estimated from benchmarks |
| 4 | Benchmark Rates | Entirely benchmark-derived — most speculative |
This tie-breaking matters primarily during orchestrator merging, when multiple sub-version runs are combined. Within a single sub-version run, the CDF-based decimal tiebreaker typically produces unique scores — ties are rare.
canonical_rate_type records the exact rate column that won, with a category prefix: raw: payer_negotiated_rate, transform: hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol, impute: msdrg_base_rate_mult_cms_weight. Users can filter or pivot by rate type prefix to understand what fraction of their coverage comes from each tier of data quality.
Sub-version merging: The orchestrator runs N sub-DAGs (one per month) then merges them. When merging, a validated rate (score 7) from an older sub-version beats an unvalidated rate (score 6) from a newer one. Same-score ties across sub-versions use the rate type hierarchy above.
Rate Arrays
Parallel arrays per ROID, with rate values, accuracy scores, and metadata in corresponding positions. Array order defines priority for same-score tiebreaking.
Array Structure
Each ROID gets parallel arrays of the same length. Position [i] in each array corresponds to the same rate column:
| Array | Content at [i] | Example |
|---|---|---|
rate_array | Dollar value | 125.00 |
rate_score_array | Accuracy score (0–7.x) | 7.00000125 |
source_array | Data source label | "payer" |
rate_type_array | Column name with prefix | "raw: payer_negotiated_rate" |
methodology_array | Contract methodology | "Fee Schedule" |
rate_class_array | "Raw", "Transform", or "Impute" | "Raw" |
Priority Ordering (Inpatient vs Non-Inpatient)
Array position matters when multiple rates have identical scores. The ordering differs by bill type:
Inpatient
- Hospital raw rates
- Hospital untransformed
- Payer raw rates
- Payer untransformed
- Hospital transforms (provider GC)
- Payer transforms (provider GC)
- Hospital transforms (CBSA GC)
- Payer transforms (CBSA GC)
- State-level transforms
- Imputations
Non-Inpatient
Similar but:
- Untransformed rates get slight priority
- Provider-level GC before CBSA
- CBSA before state
- Hospital source before payer source within each tier
The arrays are also built without imputations (*_no_impute variants) for intermediate tracking — e.g., best_idx_no_impute records what the canonical rate would be if no imputations existed.
Selection Algorithm
The core algorithm: find max score, use its array index to select rate and all metadata.
Special Rules
- Score = 7 → "payer_hospital": When the best score is 7.x (counterparty validated),
canonical_rate_sourceis always set to"payer_hospital"regardless of which column won — because both sources agreed and the label should reflect that - Multiple best indices:
best_payer_idx,best_hospital_idx, andbest_idx_no_imputeare tracked separately alongsidebest_idxfor downstream analysis and QA - Gross charge type derivation:
canonical_gross_charge_typeis inferred from the rate_type name — e.g., a column ending ingc_hosp_cbsa_perc_to_dol→ "mrf_gross_charge_cbsa_median" - NULL rates: ROIDs where every column scores 0 get
canonical_rate = NULLandcanonical_rate_score = 0. The ROID row is preserved — a NULL canonical rate means a genuine coverage gap, not a missing row
Determinism: The algorithm is fully deterministic. Given the same input arrays, it always produces the same canonical rate. The CDF and rate/1e8 tiebreakers ensure no two rates have exactly the same score in practice.
Walk-Through Example
Step-by-step canonical rate selection for two contrasting ROIDs.
Example A — Cross-validated: Code 99213, Provider X, Payer Y
Available rates:
- Payer negotiated rate: $125
- Hospital case rate: $120
- Payer pct-to-dol transform: 160 gross charge)
- Imputed rate: $115
Accuracy scoring:
- Payer 120]: |120|=125=$25 → MATCH → score = 7 + 125/1e8 = 7.00000125
- Hospital 125]: |125|=120=$24 → MATCH → score = 7.00000120
- Transform $130: no counterparty match, within bounds, CDF=0.54 → score = 6.54
- Imputed $115: within bounds, CDF=0.23 → score = 4.23
Rate arrays (in order):
rate_array: [$125, $120, $130, $115 ]
rate_score_array: [7.00000125, 7.00000120, 6.54, 4.23 ]
source_array: ["payer", "hospital", "payer", "imputation"]
rate_type_array: ["raw: payer_negotiated_rate", "raw: hospital_case_rate_dollar", "transform: ...", "impute: ..."]
Selection:
best_score = array_max([7.00000125, 7.00000120, 6.54, 4.23]) = 7.00000125
best_idx = ARRAY_POSITION(rate_score_array, 7.00000125) = 1
canonical_rate = $125.00
canonical_rate_score = 5 (7.x → confidence 5)
canonical_rate_source = "payer_hospital" (score=7 → both sources agreed)
canonical_rate_type = "raw: payer_negotiated_rate"
canonical_rate_class = "Raw"
The payer rate (120) because 7.00000125 > 7.00000120. Both are counterparty-validated — the tiebreaker (rate/1e8) slightly prefers the higher dollar amount. Source is labeled "payer_hospital" because score=7 means both agreed.
Example B — Outlier payer rate: Code 99213, Urgent Care, BCBS PPO
Available rates:
- Payer MRF rate: 92. This is 45× Medicare.
- Imputed rate (RC HCPCS): $155 — derived from RC family average across similar codes
Accuracy scoring:
- Payer $4,200: 45× Medicare, upper bound = 30× → OUTLIER → score = 1.xx
- Imputed $155: within bounds (1.7× Medicare), CDF=0.61 → score = 2.61
Selection: score 2.61 > score 1.xx → imputed rate wins
canonical_rate = $155.00
canonical_rate_score = 2 (imputed, within bounds)
canonical_rate_source = "imputation"
canonical_rate_type = "impute: rc_family_gc_hosp_perc_to_dol"
The outlier payer rate is preserved in the wide table (its column is not deleted) but loses to the imputation because score 1 < score 2. A user filtering on canonical_rate_score > 1 would see the imputed 4,200 outlier.