Version: 2.1

3. Data Processing

Transformations

Rates may be stored in MRFs as per diems or percentages. A transformation is applied to convert these values into a dollar.

Per Diem to Dollar

To convert a per diem rate to a dollar amount, we multiply the per diem rate by the geometric mean length of stay (GLOS) from medicare (as of 2025-07-16). Soon, we'll start estimating provider-specific GLOS values using historical claims data.

GLOS source table: tq_production.reference_legacy.ref_cms_msdrg

geometric mean

The geometric mean is the product of n values, raised to the power of 1/n.

$GLOS = (\prod_{i=1}^{n} LOS_i)^{1/n}$

where $LOS_i$ is the length of stay for claim $i$ and $n$ is the total number of claims.

Percentage to Dollar

To convert a percentage rate to a dollar amount, we multiply the percentage by the gross charge.

There are two sources of gross charge data:

Hospital MRFs
Claims Data

For more on gross charges, see data collection.

We compute a dollar rate for each available gross charge value. As an example, all of the following are computed:

percentage * gross charge from hospital MRF
percentage * gross charge from claims (provider median)
percentage * CBSA average gross charge from hospital MRF
percentage * STATE average gross charge from hospital MRF
percentage * CBSA average gross charge from claims
percentage * STATE average gross charge from claims

For gross charge averages estimated at the CBSA or STATE level, we use "CCR adjustments" to adjust the gross charge based on the provider's cost-to-charge ratio (CCR). See data collection for more details.

Imputation

Imputations are used to fill gaps in MRF data where rate information is missing or incomplete. Rather than leaving rate objects without rates, the CLD pipeline employs sophisticated statistical methods to infer missing rates based on patterns observed in available data.

The imputation system operates on a tier-based hierarchy, where each tier represents a different methodology for inferring missing rates, ranked from most to least reliable:

Tier 1: Exact Code and Rate Match

This tier uses rates that exactly match the rate object with minimal transformation:

Payer/Hospital MRF Dollar Amounts: Direct dollar rates from MRF files
Hospital MRF Estimated Allowed Amounts: Calculated allowed amounts from hospital data
Percentage with Exact Charge Match: MRF percentages multiplied by exact gross charge matches
Claims-Based Allowed Amounts: Validated rates from Komodo claims data with sufficient sample size (N > 11)

Tier 2: Exact Code Match with Translation

This tier involves exact code matches but requires data transformation:

Percentage with Estimated Charge: MRF percentages multiplied by market-average gross charges when exact matches aren't available
Per Diem to Dollar Conversion: Per diem rates multiplied by Medicare geometric mean length of stay (GLOS)
APR-DRG to MS-DRG Mapping: APR-DRG rates transformed using TQ's custom crosswalk
APC to HCPCS Mapping: APC rates mapped to HCPCS codes using CMS reference data

Tier 3: Inferred Provision Match

This tier identifies common patterns in rate structures to infer missing rates:

MS-DRG Base Rates

Case Rate Inference: When multiple MS-DRG rates divided by their CMS weights share a common base rate, this base rate is used to impute missing MS-DRG rates
Base Percentage Rates: When multiple MS-DRG percentages share a common value, suggesting a global inpatient percentage rate

HCPCS Outpatient Rates

Outpatient Procedure Grouper (OPG) Rates: Common rates identified across related outpatient procedure codes
Outpatient Base Percentage: Shared percentage rates across multiple HCPCS codes

Revenue Code (RC) Based Rates

RC Global Rates: When 30+ revenue codes share a common rate, indicating a global reimbursement methodology
RC Family Rates: Specialized rates for specific service areas (NICU, ICU, CCU, etc.) identified through revenue code pattern analysis

Imputation Process

Stage 1: Data Preparation and Chunking

Payer-Based Partitioning: Rate objects are processed in chunks partitioned by payer_id to optimize parallel processing and memory usage
Feature Enrichment: Each rate object is enriched with code characteristics (MRF base rate, OPG Grouper)
Long-Format Transformation: Raw rates are pivoted from wide format (multiple rate columns) to long format for pattern analysis

Stage 2: Statistical Pattern Recognition

Frequency Analysis by Provider-Network-Payer Groups:

-- Example pattern recognition for MS-DRG base rates
SELECT 
    payer_id, network_id, provider_id,
    ROUND(rate_value / cms_weight, 0) as candidate_base_rate,
    COUNT(*) as frequency,
    COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY payer_id, network_id, provider_id) as coverage_ratio
FROM rates_long 
WHERE billing_code_type = 'MS-DRG'
GROUP BY payer_id, network_id, provider_id, candidate_base_rate

RC Family Matching with Jaccard Index:

-- Jaccard Index calculation for RC family matching
jaccard_index = COUNT(intersection_codes) / COUNT(union_codes)

Stage 3: Base Rate Validation and Application

Primary Validation: Check frequency thresholds and coverage requirements for each candidate base rate
Rate Reasonableness: Apply business logic checks (e.g., base rates shouldn't be negative, shouldn't exceed 10x Medicare rates)

Rate Application with Multipliers:

-- Example: Applying MS-DRG base rate with CMS weight multiplier
CASE 
    WHEN msdrg_n_freq >= 10 AND msdrg_coverage_ratio >= 0.90 
    THEN msdrg_candidate_base_rate * cms_weight
    ELSE NULL 
END as msdrg_imputed_rate

Stage 4: Gross Charge Integration

Hierarchical Charge Selection for Percentage-Based Imputations:

Provider-Specific Charges: Use exact provider-code gross charges when available
CBSA Market Averages: Fall back to metropolitan area averages with CCR adjustments
State Market Averages: Use state-level averages as final fallback
National Benchmarks: Apply national averages only for rare codes with insufficient regional data

CCR Adjustment Application:

-- Adjust market-average gross charges based on provider cost-to-charge ratio
adjusted_gross_charge = market_gross_charge * (provider_ccr / market_avg_ccr)

Quality Controls

The imputation system implements multiple layers of quality control to ensure reliability and accuracy:

Statistical Thresholds

Minimum Sample Size Requirements:

MS-DRG Case Rates: ≥10 observations sharing common base rate
MS-DRG Percentage Rates: ≥50 observations with ≥90% coverage
Outpatient Procedure Grouper (OPG): ≥15 codes with ≥70% coverage, minimum 40 total codes in grouper
Outpatient Percentage Rates: ≥200 observations with ≥90% coverage
Revenue Code Global Rates: ≥30 distinct revenue codes sharing common rate

Coverage Requirements:

MS-DRG Imputations: Base rate must represent >90% of available MS-DRG rates for the provider-network-payer combination
RC Family Matching: Jaccard Index similarity score >25% between observed revenue code array and reference mapping
Cross-Provider Validation: Base rates flagged if they deviate >3 standard deviations from similar providers

Data Quality Checks

Rate Reasonableness Validation:

-- Example quality control checks
WHERE imputed_rate > 0 
  AND imputed_rate < 1000000  -- Maximum reasonable rate cap
  AND imputed_rate > (medicare_rate * 0.1)  -- Minimum 10% of Medicare
  AND imputed_rate < (medicare_rate * 20)   -- Maximum 20x Medicare

For detailed methodology and examples, see Imputation Tiers.

Accuracy

The Accuracy component evaluates every rate in the CLD pipeline to determine its reliability and assign confidence scores. This process ensures that only the highest-quality rates are selected as canonical rates for each rate object.

Validation Process

The accuracy system operates in two main stages:

Accuracy Assessment

Each rate receives a validation score based on its source and characteristics:

Source Quality: Raw MRF rates score higher than transformed or imputed rates
Outlier Detection: Rates are compared against expected distributions to identify outliers
Benchmark Validation: Rates are cross-referenced with Medicare rates, claims data, and other benchmarks
Statistical Likelihood: For tied rates, likelihood scores based on expected distributions determine precedence

Business rules are applied to refine accuracy scores:

Drug Rate Preferences: For pharmaceutical codes, hospital rates are preferred over payer rates when tied
Provider Type Considerations: Different validation criteria for hospitals, ASCs, imaging centers, etc.
Code Type Logic: Specialized rules for MS-DRGs, HCPCS, APCs, and other code types

Canonical Rate Score Scale

The final accuracy assessment produces a 1-5 canonical rate score:

Score	Interpretation	Description
5	Validated	Rate confirmed through cross-validation between Payer + Hospital MRF raw or transformed data
4	Raw Posted Rate	Direct dollar amount from payer/hospital MRF, not an outlier
3	Validated Transform/Imputation	Calculated or imputed rate confirmed by benchmark validation
2	Unvalidated Transform/Imputation	Calculated or imputed rate, reasonable but not benchmark-validated
1	Outlier	Rate identified as statistical outlier, low confidence
0	No Rate	No rate data available for this rate object

Canonical Rate Selection

The canonical rate selection process chooses the "best" rate for each rate object through a hierarchical approach:

Within Sub-Version

Highest Score Wins: Select the rate with the highest canonical rate score
Tie-Breaking: When multiple rates have the same score, use likelihood percentiles from the validation score decimals

Across Sub-Versions

When combining multiple sub-versions into prod_combined_all:

Score Hierarchy: Compare rates across recent sub-versions, selecting highest score
Rate Type Hierarchy (for ties):
- Posted Rates (direct MRF data)
- Real World Rates (claims-based)
- Enhanced Rates (sophisticated imputations)
- Benchmark Rates (Medicare/reference rates)

Outlier Detection Methods

See Outlier Detection for details on how outliers are identified and handled in the accuracy process.

For implementation details and scoring methodologies, see Accuracy Scores and Canonical Selection.

Transformations​

Per Diem to Dollar​

Percentage to Dollar​

Imputation​

Tier 1: Exact Code and Rate Match​

Tier 2: Exact Code Match with Translation​

Tier 3: Inferred Provision Match​

MS-DRG Base Rates​

HCPCS Outpatient Rates​

Revenue Code (RC) Based Rates​

Imputation Process​

Stage 1: Data Preparation and Chunking​

Stage 2: Statistical Pattern Recognition​

Stage 3: Base Rate Validation and Application​

Stage 4: Gross Charge Integration​

Quality Controls​

Statistical Thresholds​

Data Quality Checks​

Accuracy​

Validation Process​

Accuracy Assessment​

Canonical Rate Score Scale​

Canonical Rate Selection​

Within Sub-Version​

Across Sub-Versions​

Outlier Detection Methods​

On this page: