Version: 3.0

Rate Selection

By the time rate selection runs, each ROID has dozens of candidate rates across raw, transformed, and imputed columns — all scored 0–7. Rate selection picks the one winner per ROID and writes it to canonical_rate.

Overview

How All Sources Come Together

Rate selection reads from tmp_int_accuracy_brit — the final wide table produced by BRIT accuracy. Every rate column and its _validation_score counterpart are present on a single row per ROID. Rate selection does not join anything new; it just scans what's already there.

Selection sequence:

BRIT Accuracy Table — One row per ROID. Every rate column (raw, transformed, imputed) with its _validation_score. Input to rate selection. → tmp_int_accuracy_brit
Build Rate Arrays — Collect all non-NULL rate columns into parallel arrays: rate values, scores, sources, types, methodologies. Array position encodes priority for tiebreaking. → (CTE inside combined_main.sql)
Select Canonical Rate — array_max(rate_score_array) finds the highest score. ARRAY_POSITION extracts the index. All canonical_* fields are populated from that index. → tmp_int_combined_no_whisp
Add Whispers — Optional enrichment layer. Summary statistics at payer/network/provider/code aggregation levels are joined in. → tmp_int_combined

Canonical Rate Score (1–5)

After selection, the internal 0–7 accuracy score is translated into a user-facing 1–5 scale:

Canonical Score	Internal Score	Meaning
5	7.x	Payer + hospital MRF independently agreed within ±20%
4	6.x	Single MRF source, within outlier bounds (not cross-validated)
3	4.x or 5.x	Within bounds, no counterparty validation
2	2.x or 3.x	Imputed estimate (enhanced or benchmark-backed)
1	1.x	Outlier — outside all acceptable bounds, last resort
0	0	No rate — ROID has no usable data from any source

info

QA tip: When checking CLD output, always filter on canonical_rate_score > 1 to exclude outliers, or canonical_rate_score > 0 to exclude empty ROIDs. A NULL canonical_rate means no data was found at any tier — the ROID still exists in the output, just with nulls.

Output Columns

Column	Meaning	Example
`canonical_rate`	The selected dollar amount	18500.00
`canonical_rate_score`	1–5 confidence (user-facing)	5
`canonical_rate_source`	Which source won ("payer", "hospital", "imputation", "payer_hospital")	"payer_hospital"
`canonical_rate_type`	Exact rate column that won, prefixed by category	"raw: payer_negotiated_rate"
`canonical_contract_methodology`	Contract methodology of the winning rate	"Fee Schedule"
`canonical_rate_class`	"Raw", "Transform", or "Impute"	"Raw"
`canonical_gross_charge_type`	Gross charge source used (for enhanced rates)	"mrf_gross_charge_cbsa_median"

Rate Type Categories

Every rate in the CLD output belongs to one of four categories. These categories reflect how defensible the rate is: where the data came from and how much estimation was involved.

Category	Source	Description	Typical `canonical_rate_type` prefix
Posted Rates	Payer MRF, Hospital MRF	Fixed dollar amounts from MRF files. No estimation involved. The most defensible category.	`raw: payer_`, `raw: hospital__dollar`, `impute: msdrg_mrf_base_rate_*`
Real-World Rates	Payer MRF, Hospital MRF, Komodo claims	Variable-rate provisions where all components come from reported data (e.g. 90% of charges × hospital's own reported charge). Rate may fluctuate claim-to-claim but is fully grounded in source data.	`raw: hospital__allowed_amount`, `transform: hosp_per_diem_mult_glos`, `transform: _gc_hosp_perc_to_dol`
Enhanced Rates	MRF provision + estimated benchmark component	MRF reports one component (e.g. 90% of charges) but not the charge itself. Turquoise fills in the missing component from Komodo or geographic averages.	`transform: _gc_komodo_perc_to_dol`, `transform: _gc_hosp_cbsa_perc_to_dol`, most `impute:*`
Benchmark Rates	Medicare reference data	Entirely derived from reference or benchmark data — no reliable information from MRFs was available. Last resort.	`benchmark_*` columns

Tie-Breaking Priority

When two rates have the same accuracy score, the category hierarchy determines the winner:

Priority	Category	Rationale
1	Posted Rates	All components from reported data, no estimation
2	Real-World Rates	All components from reported data, but rate varies by claim
3	Enhanced Rates	Provision reported, but one component estimated from benchmarks
4	Benchmark Rates	Entirely benchmark-derived — most speculative

This tie-breaking matters primarily during orchestrator merging, when multiple sub-version runs are combined. Within a single sub-version run, the CDF-based decimal tiebreaker typically produces unique scores — ties are rare.

canonical_rate_type records the exact rate column that won, with a category prefix: raw: payer_negotiated_rate, transform: hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol, impute: msdrg_base_rate_mult_cms_weight. Users can filter or pivot by rate type prefix to understand what fraction of their coverage comes from each tier of data quality.

info

Sub-version merging: The orchestrator runs N sub-DAGs (one per month) then merges them. When merging, a validated rate (score 7) from an older sub-version beats an unvalidated rate (score 6) from a newer one. Same-score ties across sub-versions use the rate type hierarchy above.

Rate Arrays

Parallel arrays per ROID, with rate values, accuracy scores, and metadata in corresponding positions. Array order defines priority for same-score tiebreaking.

Array Structure

Each ROID gets parallel arrays of the same length. Position [i] in each array corresponds to the same rate column:

Array	Content at [i]	Example
`rate_array`	Dollar value	125.00
`rate_score_array`	Accuracy score (0–7.x)	7.00000125
`source_array`	Data source label	"payer"
`rate_type_array`	Column name with prefix	"raw: payer_negotiated_rate"
`methodology_array`	Contract methodology	"Fee Schedule"
`rate_class_array`	"Raw", "Transform", or "Impute"	"Raw"

Priority Ordering (Inpatient vs Non-Inpatient)

Array position matters when multiple rates have identical scores. The ordering differs by bill type:

Inpatient

Hospital raw rates
Hospital untransformed
Payer raw rates
Payer untransformed
Hospital transforms (provider GC)
Payer transforms (provider GC)
Hospital transforms (CBSA GC)
Payer transforms (CBSA GC)
State-level transforms
Imputations

Non-Inpatient

Similar but:

Untransformed rates get slight priority
Provider-level GC before CBSA
CBSA before state
Hospital source before payer source within each tier

The arrays are also built without imputations (*_no_impute variants) for intermediate tracking — e.g., best_idx_no_impute records what the canonical rate would be if no imputations existed.

Selection Algorithm

The core algorithm: find max score, use its array index to select rate and all metadata.

Special Rules

Score = 7 → "payer_hospital": When the best score is 7.x (counterparty validated), canonical_rate_source is always set to "payer_hospital" regardless of which column won — because both sources agreed and the label should reflect that
Multiple best indices: best_payer_idx, best_hospital_idx, and best_idx_no_impute are tracked separately alongside best_idx for downstream analysis and QA
Gross charge type derivation: canonical_gross_charge_type is inferred from the rate_type name — e.g., a column ending in gc_hosp_cbsa_perc_to_dol → "mrf_gross_charge_cbsa_median"
NULL rates: ROIDs where every column scores 0 get canonical_rate = NULL and canonical_rate_score = 0. The ROID row is preserved — a NULL canonical rate means a genuine coverage gap, not a missing row

info

Determinism: The algorithm is fully deterministic. Given the same input arrays, it always produces the same canonical rate. The CDF and rate/1e8 tiebreakers ensure no two rates have exactly the same score in practice.

Walk-Through Example

Step-by-step canonical rate selection for two contrasting ROIDs.

Example A — Cross-validated: Code 99213, Provider X, Payer Y

Available rates:

Payer negotiated rate: $125
Hospital case rate: $120
Payer pct-to-dol transform: $130 (81% ×$ 160 gross charge)
Imputed rate: $115

Accuracy scoring:

Payer $125 vs hospital [$ 120]: | $125-$ 120|= $5 ≤ 20%×$ 125=$25 → MATCH → score = 7 + 125/1e8 = 7.00000125
Hospital $120 vs payer [$ 125]: | $120-$ 125|= $5 ≤ 20%×$ 120=$24 → MATCH → score = 7.00000120
Transform $130: no counterparty match, within bounds, CDF=0.54 → score = 6.54
Imputed $115: within bounds, CDF=0.23 → score = 4.23

Rate arrays (in order):

rate_array:       [$125,        $120,         $130,        $115       ]
rate_score_array: [7.00000125,  7.00000120,   6.54,        4.23       ]
source_array:     ["payer",     "hospital",   "payer",     "imputation"]
rate_type_array:  ["raw: payer_negotiated_rate", "raw: hospital_case_rate_dollar", "transform: ...", "impute: ..."]

Selection:

best_score = array_max([7.00000125, 7.00000120, 6.54, 4.23]) = 7.00000125
best_idx   = ARRAY_POSITION(rate_score_array, 7.00000125) = 1

canonical_rate             = $125.00
canonical_rate_score       = 5    (7.x → confidence 5)
canonical_rate_source      = "payer_hospital"   (score=7 → both sources agreed)
canonical_rate_type        = "raw: payer_negotiated_rate"
canonical_rate_class       = "Raw"

info

The payer rate ( $125) wins over the hospital rate ($ 120) because 7.00000125 > 7.00000120. Both are counterparty-validated — the tiebreaker (rate/1e8) slightly prefers the higher dollar amount. Source is labeled "payer_hospital" because score=7 means both agreed.

Example B — Outlier payer rate: Code 99213, Urgent Care, BCBS PPO

Available rates:

Payer MRF rate: $4,200 — Medicare MPFS for 99213 is$ 92. This is 45× Medicare.
Imputed rate (RC HCPCS): $155 — derived from RC family average across similar codes

Accuracy scoring:

Payer $4,200: 45× Medicare, upper bound = 30× → OUTLIER → score = 1.xx
Imputed $155: within bounds (1.7× Medicare), CDF=0.61 → score = 2.61

Selection: score 2.61 > score 1.xx → imputed rate wins

canonical_rate        = $155.00
canonical_rate_score  = 2    (imputed, within bounds)
canonical_rate_source = "imputation"
canonical_rate_type   = "impute: rc_family_gc_hosp_perc_to_dol"

info

The outlier payer rate is preserved in the wide table (its column is not deleted) but loses to the imputation because score 1 < score 2. A user filtering on canonical_rate_score > 1 would see the imputed $155 and not the$ 4,200 outlier.

Overview​

How All Sources Come Together​

Canonical Rate Score (1–5)​

Output Columns​

Rate Type Categories​

Tie-Breaking Priority​

Rate Arrays​

Array Structure​

Priority Ordering (Inpatient vs Non-Inpatient)​

Inpatient​

Non-Inpatient​

Selection Algorithm​

Special Rules​

Walk-Through Example​

On this page:

Overview

How All Sources Come Together

Canonical Rate Score (1–5)

Output Columns

Rate Type Categories

Tie-Breaking Priority

Rate Arrays

Array Structure

Priority Ordering (Inpatient vs Non-Inpatient)

Inpatient

Non-Inpatient

Selection Algorithm

Special Rules

Walk-Through Example