Skip to main content
Version: 2.2

3. Data Processing

Transformations

Rates may be stored in MRFs as per diems or percentages. A transformation is applied to convert these values into a dollar.

Per Diem to Dollar

To convert a per diem rate to a dollar amount, we multiply the per diem rate by the geometric mean length of stay (GLOS) from medicare (as of 2025-07-16). Soon, we'll start estimating provider-specific GLOS values using historical claims data.

GLOS source table: tq_production.reference_legacy.ref_cms_msdrg

geometric mean

The geometric mean is the product of n values, raised to the power of 1/n.

GLOS=(i=1nLOSi)1/nGLOS = (\prod_{i=1}^{n} LOS_i)^{1/n}

where LOSiLOS_i is the length of stay for claim ii and nn is the total number of claims.

Percentage to Dollar

To convert a percentage rate to a dollar amount, we multiply the percentage by the gross charge.

There are two sources of gross charge data:

  • Hospital MRFs
  • Claims Data

For more on gross charges, see data collection.

We compute a dollar rate for each available gross charge value. As an example, all of the following are computed:

  • percentage * gross charge from hospital MRF
  • percentage * gross charge from claims (provider median)
  • percentage * CBSA average gross charge from hospital MRF
  • percentage * STATE average gross charge from hospital MRF
  • percentage * CBSA average gross charge from claims
  • percentage * STATE average gross charge from claims

For gross charge averages estimated at the CBSA or STATE level, we use "CCR adjustments" to adjust the gross charge based on the provider's cost-to-charge ratio (CCR). See data collection for more details.

Imputation

Imputations are used to fill gaps in MRF data where rate information is missing or incomplete. Rather than leaving rate objects without rates, the CLD pipeline employs sophisticated statistical methods to infer missing rates based on patterns observed in available data.

The imputation system operates on a tier-based hierarchy, where each tier represents a different methodology for inferring missing rates, ranked from most to least reliable:

Tier 1: Exact Code and Rate Match

This tier uses rates that exactly match the rate object with minimal transformation:

  • Payer/Hospital MRF Dollar Amounts: Direct dollar rates from MRF files
  • Hospital MRF Estimated Allowed Amounts: Calculated allowed amounts from hospital data
  • Percentage with Exact Charge Match: MRF percentages multiplied by exact gross charge matches
  • Claims-Based Allowed Amounts: Validated rates from Komodo claims data with sufficient sample size (N > 11)

Tier 2: Exact Code Match with Translation

This tier involves exact code matches but requires data transformation:

  • Percentage with Estimated Charge: MRF percentages multiplied by market-average gross charges when exact matches aren't available
  • Per Diem to Dollar Conversion: Per diem rates multiplied by Medicare geometric mean length of stay (GLOS)
  • APR-DRG to MS-DRG Mapping: APR-DRG rates transformed using TQ's custom crosswalk
  • APC to HCPCS Mapping: APC rates mapped to HCPCS codes using CMS reference data

Tier 3: Inferred Provision Match

This tier identifies common patterns in rate structures to infer missing rates:

MS-DRG Base Rates

  • Case Rate Inference: When multiple MS-DRG rates divided by their CMS weights share a common base rate, this base rate is used to impute missing MS-DRG rates
  • Base Percentage Rates: When multiple MS-DRG percentages share a common value, suggesting a global inpatient percentage rate

HCPCS Outpatient Rates

  • Outpatient Procedure Grouper (OPG) Rates: Common rates identified across related outpatient procedure codes
  • Outpatient Base Percentage: Shared percentage rates across multiple HCPCS codes

Revenue Code (RC) Based Rates

  • RC Global Rates: When 30+ revenue codes share a common rate, indicating a global reimbursement methodology
  • RC Family Rates: Specialized rates for specific service areas (NICU, ICU, CCU, etc.) identified through revenue code pattern analysis

Imputation Process

Stage 1: Data Preparation and Chunking

  • Payer-Based Partitioning: Rate objects are processed in chunks partitioned by payer_id to optimize parallel processing and memory usage
  • Feature Enrichment: Each rate object is enriched with code characteristics (MRF base rate, OPG Grouper)
  • Long-Format Transformation: Raw rates are pivoted from wide format (multiple rate columns) to long format for pattern analysis

Stage 2: Statistical Pattern Recognition

Frequency Analysis by Provider-Network-Payer Groups:

-- Example pattern recognition for MS-DRG base rates
SELECT
payer_id, network_id, provider_id,
ROUND(rate_value / cms_weight, 0) as candidate_base_rate,
COUNT(*) as frequency,
COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY payer_id, network_id, provider_id) as coverage_ratio
FROM rates_long
WHERE billing_code_type = 'MS-DRG'
GROUP BY payer_id, network_id, provider_id, candidate_base_rate

RC Family Matching with Jaccard Index:

-- Jaccard Index calculation for RC family matching
jaccard_index = COUNT(intersection_codes) / COUNT(union_codes)

Stage 3: Base Rate Validation and Application

  1. Primary Validation: Check frequency thresholds and coverage requirements for each candidate base rate
  2. Rate Reasonableness: Apply business logic checks (e.g., base rates shouldn't be negative, shouldn't exceed 10x Medicare rates)

Rate Application with Multipliers:

-- Example: Applying MS-DRG base rate with CMS weight multiplier
CASE
WHEN msdrg_n_freq >= 10 AND msdrg_coverage_ratio >= 0.90
THEN msdrg_candidate_base_rate * cms_weight
ELSE NULL
END as msdrg_imputed_rate

Stage 4: Gross Charge Integration

Hierarchical Charge Selection for Percentage-Based Imputations:

  1. Provider-Specific Charges: Use exact provider-code gross charges when available
  2. CBSA Market Averages: Fall back to metropolitan area averages with CCR adjustments
  3. State Market Averages: Use state-level averages as final fallback
  4. National Benchmarks: Apply national averages only for rare codes with insufficient regional data

CCR Adjustment Application:

-- Adjust market-average gross charges based on provider cost-to-charge ratio
adjusted_gross_charge = market_gross_charge * (provider_ccr / market_avg_ccr)

Quality Controls

The imputation system implements multiple layers of quality control to ensure reliability and accuracy:

Statistical Thresholds

Minimum Sample Size Requirements:

  • MS-DRG Case Rates: ≥10 observations sharing common base rate
  • MS-DRG Percentage Rates: ≥50 observations with ≥90% coverage
  • Outpatient Procedure Grouper (OPG): ≥15 codes with ≥70% coverage, minimum 40 total codes in grouper
  • Outpatient Percentage Rates: ≥200 observations with ≥90% coverage
  • Revenue Code Global Rates: ≥30 distinct revenue codes sharing common rate

Coverage Requirements:

  • MS-DRG Imputations: Base rate must represent >90% of available MS-DRG rates for the provider-network-payer combination
  • RC Family Matching: Jaccard Index similarity score >25% between observed revenue code array and reference mapping
  • Cross-Provider Validation: Base rates flagged if they deviate >3 standard deviations from similar providers

Data Quality Checks

Rate Reasonableness Validation:

-- Example quality control checks
WHERE imputed_rate > 0
AND imputed_rate < 1000000 -- Maximum reasonable rate cap
AND imputed_rate > (medicare_rate * 0.1) -- Minimum 10% of Medicare
AND imputed_rate < (medicare_rate * 20) -- Maximum 20x Medicare

For detailed methodology and examples, see Imputation Tiers.


Accuracy

The Accuracy component evaluates every rate in the CLD pipeline to determine its reliability and assign confidence scores. This process ensures that only the highest-quality rates are selected as canonical rates for each rate object.

Validation Process

The accuracy system operates in two main stages:

Accuracy Assessment

Each rate receives a validation score based on its source and characteristics:

  • Source Quality: Raw MRF rates score higher than transformed or imputed rates
  • Outlier Detection: Rates are compared against expected distributions to identify outliers
  • Benchmark Validation: Rates are cross-referenced with Medicare rates, claims data, and other benchmarks
  • Statistical Likelihood: For tied rates, likelihood scores based on expected distributions determine precedence

Business rules are applied to refine accuracy scores:

  • Drug Rate Preferences: For pharmaceutical codes, hospital rates are preferred over payer rates when tied
  • Provider Type Considerations: Different validation criteria for hospitals, ASCs, imaging centers, etc.
  • Code Type Logic: Specialized rules for MS-DRGs, HCPCS, APCs, and other code types

Canonical Rate Score Scale

The final accuracy assessment produces a 1-5 canonical rate score:

ScoreInterpretationDescription
5ValidatedRate confirmed through cross-validation between Payer + Hospital MRF raw or transformed data
4Raw Posted RateDirect dollar amount from payer/hospital MRF, not an outlier
3Validated Transform/ImputationCalculated or imputed rate confirmed by benchmark validation
2Unvalidated Transform/ImputationCalculated or imputed rate, reasonable but not benchmark-validated
1OutlierRate identified as statistical outlier, low confidence
0No RateNo rate data available for this rate object

Canonical Rate Selection

The canonical rate selection process chooses the "best" rate for each rate object through a hierarchical approach:

Within Sub-Version

  1. Highest Score Wins: Select the rate with the highest canonical rate score
  2. Tie-Breaking: When multiple rates have the same score, use likelihood percentiles from the validation score decimals

Across Sub-Versions

When combining multiple sub-versions into prod_combined_all:

  1. Score Hierarchy: Compare rates across recent sub-versions, selecting highest score
  2. Rate Type Hierarchy (for ties):
    • Posted Rates (direct MRF data)
    • Real World Rates (claims-based)
    • Enhanced Rates (sophisticated imputations)
    • Benchmark Rates (Medicare/reference rates)

Outlier Detection Methods

See Outlier Detection for details on how outliers are identified and handled in the accuracy process.


For implementation details and scoring methodologies, see Accuracy Scores and Canonical Selection.