Version: 3.0

Pipeline Architecture

CLD builds a wide table, one row per ROID, by progressively adding columns across phases. Rate selection then scans those columns — guided by accuracy scores — to pick a single canonical rate per ROID.

The core idea: rows from ROS, columns from everything else

The Rate Object Space (ROS) defines the row set — every valid (payer × network × provider × code) combination CLD will try to price. Every downstream phase joins against the ROS by ROID and contributes new columns. No phase changes the row count.

How a single ROID accumulates data across phases:

Stage 1 — Rate Object Space: A ROID is minted for (UHC, Choice Plus PPO, Mass General, MS-DRG 470, Inpatient). This row now exists in the pipeline. All downstream columns start as NULL.
Stage 2 — Raw Data: UHC's payer MRF reports a negotiated rate of $18,000 for this combination. The hospital MRF reports 130% of billed charges. Both land as separate columns on this row: payer_negotiated_rate = 18000, hospital_pct_of_total_billed_charges_pct = 130. Komodo has no data for this ROID — those columns stay NULL.
Stage 3 — Transformations: The 130% figure is resolved against Mass General's gross charge for MS-DRG 470 ( $28,000): `hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol = 0.01 × 130 × 28000 =$ 36,400`. The negotiated $18,000 is already in dollars — no transformation column added.
Stage 4 — Imputations: Two rate candidates exist, so no imputation is needed. For a different ROID where both payer and hospital columns are NULL, the imputation phase would compute a fallback estimate: imputed_rate = $21,000.
Stage 5 — Accuracy: Each non-NULL rate column gets a score. The payer negotiated rate ( $18,000) is validated against the hospital MRF — accuracy score = 7. The pct-to-dollar transform ($ 36,400) is not outlier-validated — accuracy score = 4. Columns with no data get score = 0.

Result: Stage 6 — Rate Selection

The highest-scored non-NULL column wins: payer_negotiated_rate ($18,000, score 7) becomes canonical_rate = 18000, canonical_rate_score = 5 (external scoring scale).

The wide table structure

By the time rate selection runs, each ROID row has dozens of populated or NULL columns — one per source, method, and gross charge variant. Here's a simplified view of the column groups:

Phase	Column group	Example columns	Accuracy score column
Raw — Payer MRF	One per negotiated_type	`payer_negotiated_rate`, `payer_fee_schedule_rate`, `payer_percentage_rate`, `payer_derived_rate`	`payer_negotiated_rate_validation_score`, `payer_fee_schedule_rate_validation_score`, …
Raw — Hospital MRF	One per contract_methodology × amount type	`hospital_fee_schedule_dollar`, `hospital_pct_of_total_billed_charges_pct`, `hospital_per_diem_rate`, `hospital_case_rate_dollar`	`hospital_fee_schedule_dollar_validation_score`, `hospital_pct_of_total_billed_charges_pct_validation_score`, …
Transformations — Pct-to-Dollar	6 rate types × 6 gross charge sources	`payer_gc_hosp_perc_to_dol`, `hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol`, `hospital_perc_of_total_billed_charges_gc_komodo_cbsa_perc_to_dol`	`payer_gc_hosp_perc_to_dol_validation_score`, …
Transformations — Drug / Anesthesia	Drug dosage methods; anesthesia per negotiated_type	`drug_dosage_std_dollar`, `drug_dosage_std_ndc_dollar`, `payer_negotiated_rate_anesthesia_cf`	`drug_dosage_std_dollar_validation_score`, …
Imputations	One per imputation tier	`imputed_rate`, `imputed_rate_rc`, `imputed_rate_cstm`	`imputed_rate_validation_score`, `imputed_rate_rc_validation_score`, …

Rate selection: one winner per ROID

Rate selection scans all scored columns for a ROID and picks the one with the highest accuracy score. Ties are broken by source preference (payer MRF over hospital MRF over imputation). The winner is written to canonical_rate, and the selection is recorded in canonical_rate_source and canonical_rate_subversion for full traceability.

ROIDs where every column scored 0 (no data at all) get canonical_rate = NULL. They still appear in the output — the ROS row is preserved — with NULL rate columns indicating a genuine coverage gap.

Pipeline flow

The pipeline processes stages sequentially:

Rate Object Space — Defines all ROIDs. Every row starts with all rate columns = NULL. → tmp_rate_object_space
Raw Data — Payer MRF, Hospital MRF, Komodo, Gross Charges — each adds columns. Most ROIDs have at most a few non-NULL columns. → tmp_int_combined_raw
Transformations — Pct-to-dollar (36 columns), per diem GLOS, drug dosage (3 methods), anesthesia (payer-specific). Additive — raw columns preserved. → tmp_int_transformations
Imputations — Estimates for ROIDs with no raw or transformed rate. Falls back through a tier hierarchy. → tmp_int_imputations
Accuracy — A score (0–7) is assigned to each non-NULL rate column. Scores encode data quality, outlier status, and counterparty validation. → tmp_int_accuracy_brit
Rate Selection — Highest-scored column per ROID wins. Written to canonical_rate + canonical_rate_source. → prod_combined_abridged / prod_combined_all

The core idea: rows from ROS, columns from everything else​

The wide table structure​

Rate selection: one winner per ROID​

Pipeline flow​

On this page:

The core idea: rows from ROS, columns from everything else

The wide table structure

Rate selection: one winner per ROID

Pipeline flow