Pipeline Architecture
CLD builds a wide table, one row per ROID, by progressively adding columns across phases. Rate selection then scans those columns — guided by accuracy scores — to pick a single canonical rate per ROID.
The core idea: rows from ROS, columns from everything else
The Rate Object Space (ROS) defines the row set — every valid (payer × network × provider × code) combination CLD will try to price. Every downstream phase joins against the ROS by ROID and contributes new columns. No phase changes the row count.
How a single ROID accumulates data across phases:
- Stage 1 — Rate Object Space: A ROID is minted for (UHC, Choice Plus PPO, Mass General, MS-DRG 470, Inpatient). This row now exists in the pipeline. All downstream columns start as NULL.
- Stage 2 — Raw Data: UHC's payer MRF reports a negotiated rate of $18,000 for this combination. The hospital MRF reports 130% of billed charges. Both land as separate columns on this row:
payer_negotiated_rate = 18000,hospital_pct_of_total_billed_charges_pct = 130. Komodo has no data for this ROID — those columns stay NULL. - Stage 3 — Transformations: The 130% figure is resolved against Mass General's gross charge for MS-DRG 470 (36,400`. The negotiated $18,000 is already in dollars — no transformation column added.
- Stage 4 — Imputations: Two rate candidates exist, so no imputation is needed. For a different ROID where both payer and hospital columns are NULL, the imputation phase would compute a fallback estimate:
imputed_rate = $21,000. - Stage 5 — Accuracy: Each non-NULL rate column gets a score. The payer negotiated rate (36,400) is not outlier-validated — accuracy score = 4. Columns with no data get score = 0.
The highest-scored non-NULL column wins: payer_negotiated_rate ($18,000, score 7) becomes canonical_rate = 18000, canonical_rate_score = 5 (external scoring scale).
The wide table structure
By the time rate selection runs, each ROID row has dozens of populated or NULL columns — one per source, method, and gross charge variant. Here's a simplified view of the column groups:
| Phase | Column group | Example columns | Accuracy score column |
|---|---|---|---|
| Raw — Payer MRF | One per negotiated_type | payer_negotiated_rate, payer_fee_schedule_rate, payer_percentage_rate, payer_derived_rate | payer_negotiated_rate_validation_score, payer_fee_schedule_rate_validation_score, … |
| Raw — Hospital MRF | One per contract_methodology × amount type | hospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, hospital_case_rate_dollar | hospital_fee_schedule_dollar_validation_score, hospital_pct_of_total_billed_charges_pct_validation_score, … |
| Transformations — Pct-to-Dollar | 6 rate types × 6 gross charge sources | payer_gc_hosp_perc_to_dol, hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol, hospital_perc_of_total_billed_charges_gc_komodo_cbsa_perc_to_dol | payer_gc_hosp_perc_to_dol_validation_score, … |
| Transformations — Drug / Anesthesia | Drug dosage methods; anesthesia per negotiated_type | drug_dosage_std_dollar, drug_dosage_std_ndc_dollar, payer_negotiated_rate_anesthesia_cf | drug_dosage_std_dollar_validation_score, … |
| Imputations | One per imputation tier | imputed_rate, imputed_rate_rc, imputed_rate_cstm | imputed_rate_validation_score, imputed_rate_rc_validation_score, … |
Rate selection: one winner per ROID
Rate selection scans all scored columns for a ROID and picks the one with the highest accuracy score. Ties are broken by source preference (payer MRF over hospital MRF over imputation). The winner is written to canonical_rate, and the selection is recorded in canonical_rate_source and canonical_rate_subversion for full traceability.
ROIDs where every column scored 0 (no data at all) get canonical_rate = NULL. They still appear in the output — the ROS row is preserved — with NULL rate columns indicating a genuine coverage gap.
Pipeline flow
The pipeline processes stages sequentially:
- Rate Object Space — Defines all ROIDs. Every row starts with all rate columns = NULL. →
tmp_rate_object_space - Raw Data — Payer MRF, Hospital MRF, Komodo, Gross Charges — each adds columns. Most ROIDs have at most a few non-NULL columns. →
tmp_int_combined_raw - Transformations — Pct-to-dollar (36 columns), per diem GLOS, drug dosage (3 methods), anesthesia (payer-specific). Additive — raw columns preserved. →
tmp_int_transformations - Imputations — Estimates for ROIDs with no raw or transformed rate. Falls back through a tier hierarchy. →
tmp_int_imputations - Accuracy — A score (0–7) is assigned to each non-NULL rate column. Scores encode data quality, outlier status, and counterparty validation. →
tmp_int_accuracy_brit - Rate Selection — Highest-scored column per ROID wins. Written to canonical_rate + canonical_rate_source. →
prod_combined_abridged/prod_combined_all