Imputations
Most ROIDs have no raw or transformed rate. Imputations estimate missing values using a hierarchical fallback chain — from the most specific available data to the least specific.
Overview
Why Imputations Are Needed
The Rate Object Space defines every (payer × provider × code × network) combination CLD will try to price. But payer MRFs and hospital MRFs don't cover everything — a payer might list rates for a hospital but omit certain CPT codes, or cover a network broadly without per-provider detail. Rather than leaving those ROIDs with canonical_rate = NULL, imputations estimate what the rate probably is, based on similar payers, providers, and codes.
Long Rates — the raw material
Before any imputation can be computed, all available raw and transformed rates are pivoted into a "long" format — one row per rate value rather than one row per ROID. This enables aggregations across similar ROIDs (e.g., "what do all payers pay this provider for code 99213?").
- Long rates — raw columns — Iterates over all raw dollar/percentage rate columns in
combined_raw. For each non-NULL value, emits one long row with:roid,rate,accuracy_score,colname,rate_type,provider_type,payer_id. Excludes J/Q drug codes (priced separately). Joins totmp_int_accuracy_rawto get the accuracy score for each value. →tmp_int_imputations_long_rates_raw_columns - Long rates — transformed columns — Same structure but pulling from transformation columns (pct-to-dollar, drug dosage, anesthesia, etc.). →
tmp_int_imputations_long_rates_transformed_columns - Long rates — combined — UNION of raw + transformed long tables. Only non-outlier rates (
accuracy_score > min_score) feed into imputation aggregations — this prevents bad rates from contaminating imputed values. →tmp_int_imputations_long_rates
The Full Imputation Chain
- RC Global + RC HCPCS + RC Carveouts — Use revenue code (RC) rates from hospital MRFs to impute HCPCS codes. RC Global: one rate applies to all HCPCS in a revenue code family. RC HCPCS: crosswalk resolves specific HCPCS codes. RC Carveouts: payer-specific overrides. →
tmp_int_imputations_rc_global,tmp_int_imputations_rc_hcpcs - MS-DRG Base Rate Detection — Detect whether a provider + payer has a consistent base rate structure (rate ≈ base × CMS DRG weight). If so, impute missing DRGs by multiplying the detected base rate by the CMS weight for that DRG. →
tmp_int_msdrg_base_rates - Main Imputations (chunked by payer) — Core imputation task. For each ROID with no raw rate, tries 5 aggregation tiers from most specific to least specific. Returns the first tier with enough observations. →
tmp_int_imputations - Derived Imputations (chunked by payer) — Secondary gap-filling for Hospital and ASC ROIDs using
rate_object_space+ gross charges + long rates. Builds on the main imputations output. →tmp_int_imputations_derived - CSTM Imputations — Custom surgical grouper-based imputations for specific payers (UHC, Aetna, Cigna, etc.). Groups codes into procedure tiers; imputes using tier averages from payer contract grouper tables. →
tmp_int_imputations_cstm - APR-DRG Imputations — Final fallback for APR-DRG coded ROIDs. Uses derived and CSTM outputs joined via the APR-DRG crosswalk (MS-DRG → APR-DRG). →
tmp_int_imputations_aprdrg
Walk-Through: A ROID with No Raw Rate
- ROID: Cigna PPO, Community Hospital, CPT 99213, Professional — no raw rate in payer MRF or hospital MRF
- Long rates check: Cigna PPO has rates for 99213 at other hospitals. These rates land in the long rates table and feed into tier aggregations.
- Tier 1 (Provider + Payer + Network + Code): Community Hospital + Cigna PPO + 99213 → only 2 observed rates (n < 5) → skip
- Tier 2 (Provider + Code, all payers): Community Hospital + 99213 → 8 observed rates across payers → avg = $155 → use this
imputed_rate = $155 (tier 2) | imputation_score = 2 | always lower priority than validated raw rates in rate selection
Imputation Scores
| Score | Meaning |
|---|---|
| 6 | Raw rate available, not an outlier — highest quality (passthrough, not actually imputed) |
| 3 | Hospital MRF gross charge with proof-of-concept, or state-level Medicare benchmark |
| 2 | Not validated, not an outlier — estimated from similar ROIDs |
| 1 | Outlier — rate falls outside expected bounds |
| 0 | Default / no imputation found at any tier |
Imputation scores feed into the final canonical_rate_score (1–5 scale) after rate selection. Imputed rates always score below validated raw rates, so raw data wins whenever available.
Imputation Tiers
The main imputation task tries 5 aggregation tiers in order, from most specific to least specific. The first tier with enough observations wins.
| Tier | Partition key | Min N | Example |
|---|---|---|---|
| 1 | Provider + Payer + Network + Code | 5 | Hospital X, UHC PPO, code 99213 — rates at this exact combination |
| 2 | Provider + Code (all payers) | 5 | Hospital X, code 99213 — across all payers at this provider |
| 3 | Provider Type + Code | 3 | All Hospitals, code 99213 — provider-type-level benchmark |
| 4 | State + Code | — | California, code 99213 — state-wide benchmark |
| 5 | Code (national) | — | Code 99213 nationally — last resort |
Aggregations within each tier are computed by the aggregator() macro in imputations.sql, which computes AVG, COUNT, and STDDEV over the long rates for that partition. The macro uses accuracy_score > min_score to exclude outlier rates from the aggregate.
Walk-through: imputation for missing rate
- ROID: Hospital X, UHC PPO, code 99213 — no raw or transformed rate exists
- Tier 1: Hospital X + UHC + PPO + 99213 → only 2 observed rates (n < 5) → skip
- Tier 2: Hospital X + 99213 (all payers) → 8 observed rates, avg = $185 → use this
imputed_rate = $185 (tier 2: provider + code across payers) | imputation_score = 2
MS-DRG Base Rate Detection
Detect if a provider + payer has a consistent base rate structure (rate ≈ base_rate × CMS DRG weight). If so, impute missing DRGs using the detected base rate.
Why Base Rates?
Many hospital contracts are structured as a single base rate that is multiplied by the CMS relative weight for each DRG. If CLD observes this pattern across several DRGs for a given provider + payer, it can detect the implied base rate and use it to fill in DRGs where no rate was reported.
Walk-through: base rate detection
Observed rates at Hospital X, UHC PPO:
MS-DRG 470 (CMS weight 2.04): rate = $20,400 → implied base = $10,000
MS-DRG 392 (CMS weight 1.12): rate = $11,200 → implied base = $10,000
MS-DRG 291 (CMS weight 0.89): rate = $8,900 → implied base = $10,000
MS-DRG 194 (CMS weight 1.50): rate = $15,100 → implied base ≈ $10,067
- CV check: coefficient of variation across implied bases < 0.15 → consistent base rate pattern confirmed
- Detected base rate: $10,000 (average of implied bases)
- Impute missing DRG 871 (weight 3.50): 35,000**
The detection uses the long rates table filtered to billing_code_type = 'MS-DRG' and bill_type = 'Inpatient'. Allowed amount columns are excluded. Minimum 5 observed DRG rates are required; CV must be < 0.15.
Revenue Code (RC) Imputations
Use revenue code rates from hospital MRFs to impute HCPCS codes. RC = Revenue Code — the 4-digit codes hospitals use to report services at the departmental level (e.g., 0120 = Medical/Surgical Supplies).
Why Revenue Codes?
Hospital MRFs sometimes report rates at the revenue code family level rather than at the individual HCPCS level. For example, a hospital might report a single rate for "Operating Room" (RC 036x) without breaking it down by individual procedure codes. The RC imputations reverse-engineer HCPCS rates from these family-level prices using crosswalk tables.
Three RC Imputation Types
- RC Global: A payer posts a single rate for a revenue code family in
core_rates(billing_code_type = 'RC'). CLD applies this rate to all HCPCS codes in that family. Scoped to inpatient RCs (codes starting with 00, 01, 020, 021). →tmp_int_imputations_rc_global - RC HCPCS: More granular crosswalk from revenue codes to specific HCPCS codes using
rc_to_hcpcs_cross_validated_75. When a payer reports an RC rate, this table maps it to the specific HCPCS codes that are typically billed under that revenue code. →tmp_int_imputations_rc_hcpcs - RC Carveout: Some HCPCS codes have specific pricing that overrides the family rate — they are "carved out" of the RC imputation and treated separately.
RC imputations are preprocessing inputs, not the final imputation output. They feed into the main imputations.sql chain alongside long rates, providing additional rate observations for the tier aggregations.
CSTM & APR-DRG Imputations
Late-stage fallback imputations after the main chain. CSTM uses surgical procedure groupers; APR-DRG uses the crosswalk from MS-DRG outputs.
CSTM (Custom Surgical Grouper) Imputations
Some payers structure their contracts around surgical procedure tiers rather than individual HCPCS codes. CSTM imputations encode these payer-specific grouper relationships:
- Reference tables:
tq_intermediate.cld_utils.contract_surg_grouper_{payer}_v0— one per payer (UHC, Aetna, Cigna, etc.) - Groups HCPCS codes into procedure tiers; imputes missing codes using tier averages from the payer's observed rates
- Source: pulls from
core_ratesjoined to network/provider spines — scoped to Inpatient and Outpatientbill_types - Output:
tmp_int_imputations_cstm
APR-DRG Imputations
APR-DRG ROIDs (used by some hospitals instead of MS-DRG) can be imputed by leveraging the derived and CSTM imputation outputs for related MS-DRGs, then crosswalking back via the APR-DRG crosswalk table:
- Filter
rate_object_spaceto APR-DRG billing codes - Aggregate derived imputations and CSTM imputations using the APR-DRG → MS-DRG crosswalk
- Join back with gross charges for the APR-DRG ROID
- Output:
tmp_int_imputations_aprdrg
Imputation order matters: Main imputations → MS-DRG base rates → Derived imputations → CSTM → APR-DRG. Each stage can use outputs from previous stages. CSTM and APR-DRG are last resorts — they only activate when earlier stages found nothing.