Skip to main content
Version: 3.0

Pipeline Architecture

CLD builds a wide table, one row per ROID, by progressively adding columns across phases. Rate selection then scans those columns — guided by accuracy scores — to pick a single canonical rate per ROID.

The core idea: rows from ROS, columns from everything else

The Rate Object Space (ROS) defines the row set — every valid (payer × network × provider × code) combination CLD will try to price. Every downstream phase joins against the ROS by ROID and contributes new columns. No phase changes the row count.

How a single ROID accumulates data across phases:

  1. Stage 1 — Rate Object Space: A ROID is minted for (UHC, Choice Plus PPO, Mass General, MS-DRG 470, Inpatient). This row now exists in the pipeline. All downstream columns start as NULL.
  2. Stage 2 — Raw Data: UHC's payer MRF reports a negotiated rate of $18,000 for this combination. The hospital MRF reports 130% of billed charges. Both land as separate columns on this row: payer_negotiated_rate = 18000, hospital_pct_of_total_billed_charges_pct = 130. Komodo has no data for this ROID — those columns stay NULL.
  3. Stage 3 — Transformations: The 130% figure is resolved against Mass General's gross charge for MS-DRG 470 (28,000):hospitalpercoftotalbilledchargesgchospperctodol=0.01×130×28000=28,000): `hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol = 0.01 × 130 × 28000 = 36,400`. The negotiated $18,000 is already in dollars — no transformation column added.
  4. Stage 4 — Imputations: Two rate candidates exist, so no imputation is needed. For a different ROID where both payer and hospital columns are NULL, the imputation phase would compute a fallback estimate: imputed_rate = $21,000.
  5. Stage 5 — Accuracy: Each non-NULL rate column gets a score. The payer negotiated rate (18,000)isvalidatedagainstthehospitalMRFaccuracyscore=7.Thepcttodollartransform(18,000) is validated against the hospital MRF — accuracy score = 7. The pct-to-dollar transform (36,400) is not outlier-validated — accuracy score = 4. Columns with no data get score = 0.
Result: Stage 6 — Rate Selection

The highest-scored non-NULL column wins: payer_negotiated_rate ($18,000, score 7) becomes canonical_rate = 18000, canonical_rate_score = 5 (external scoring scale).

The wide table structure

By the time rate selection runs, each ROID row has dozens of populated or NULL columns — one per source, method, and gross charge variant. Here's a simplified view of the column groups:

PhaseColumn groupExample columnsAccuracy score column
Raw — Payer MRFOne per negotiated_typepayer_negotiated_rate, payer_fee_schedule_rate, payer_percentage_rate, payer_derived_ratepayer_negotiated_rate_validation_score, payer_fee_schedule_rate_validation_score, …
Raw — Hospital MRFOne per contract_methodology × amount typehospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, hospital_case_rate_dollarhospital_fee_schedule_dollar_validation_score, hospital_pct_of_total_billed_charges_pct_validation_score, …
Transformations — Pct-to-Dollar6 rate types × 6 gross charge sourcespayer_gc_hosp_perc_to_dol, hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol, hospital_perc_of_total_billed_charges_gc_komodo_cbsa_perc_to_dolpayer_gc_hosp_perc_to_dol_validation_score, …
Transformations — Drug / AnesthesiaDrug dosage methods; anesthesia per negotiated_typedrug_dosage_std_dollar, drug_dosage_std_ndc_dollar, payer_negotiated_rate_anesthesia_cfdrug_dosage_std_dollar_validation_score, …
ImputationsOne per imputation tierimputed_rate, imputed_rate_rc, imputed_rate_cstmimputed_rate_validation_score, imputed_rate_rc_validation_score, …

Rate selection: one winner per ROID

Rate selection scans all scored columns for a ROID and picks the one with the highest accuracy score. Ties are broken by source preference (payer MRF over hospital MRF over imputation). The winner is written to canonical_rate, and the selection is recorded in canonical_rate_source and canonical_rate_subversion for full traceability.

ROIDs where every column scored 0 (no data at all) get canonical_rate = NULL. They still appear in the output — the ROS row is preserved — with NULL rate columns indicating a genuine coverage gap.

Pipeline flow

The pipeline processes stages sequentially:

  1. Rate Object Space — Defines all ROIDs. Every row starts with all rate columns = NULL. → tmp_rate_object_space
  2. Raw Data — Payer MRF, Hospital MRF, Komodo, Gross Charges — each adds columns. Most ROIDs have at most a few non-NULL columns. → tmp_int_combined_raw
  3. Transformations — Pct-to-dollar (36 columns), per diem GLOS, drug dosage (3 methods), anesthesia (payer-specific). Additive — raw columns preserved. → tmp_int_transformations
  4. Imputations — Estimates for ROIDs with no raw or transformed rate. Falls back through a tier hierarchy. → tmp_int_imputations
  5. Accuracy — A score (0–7) is assigned to each non-NULL rate column. Scores encode data quality, outlier status, and counterparty validation. → tmp_int_accuracy_brit
  6. Rate Selection — Highest-scored column per ROID wins. Written to canonical_rate + canonical_rate_source. → prod_combined_abridged / prod_combined_all