3. Data Processing
Transformations
Rates may be stored in MRFs as per diems or percentages. A transformation is applied to convert these values into a dollar.
Per Diem to Dollar
To convert a per diem rate to a dollar amount, we multiply the per diem rate by the geometric mean length of stay (GLOS) from medicare (as of 2025-07-16). Soon, we'll start estimating provider-specific GLOS values using historical claims data.
GLOS source table: tq_production.reference_legacy.ref_cms_msdrg
The geometric mean is the product of n values, raised to the power of 1/n.
where is the length of stay for claim and is the total number of claims.
Percentage to Dollar
To convert a percentage rate to a dollar amount, we multiply the percentage by the gross charge.
There are two sources of gross charge data:
- Hospital MRFs
- Claims Data
For more on gross charges, see data collection.
We compute a dollar rate for each available gross charge value. As an example, all of the following are computed:
- percentage * gross charge from hospital MRF
- percentage * gross charge from claims (provider median)
- percentage * CBSA average gross charge from hospital MRF
- percentage * STATE average gross charge from hospital MRF
- percentage * CBSA average gross charge from claims
- percentage * STATE average gross charge from claims
For gross charge averages estimated at the CBSA or STATE level, we use "CCR adjustments" to adjust the gross charge based on the provider's cost-to-charge ratio (CCR). See data collection for more details.
Imputation
Imputations are used to fill gaps in MRF data where rate information is missing or incomplete. Rather than leaving rate objects without rates, the CLD pipeline employs sophisticated statistical methods to infer missing rates based on patterns observed in available data.
The imputation system operates on a tier-based hierarchy, where each tier represents a different methodology for inferring missing rates, ranked from most to least reliable:
Tier 1: Exact Code and Rate Match
This tier uses rates that exactly match the rate object with minimal transformation:
- Payer/Hospital MRF Dollar Amounts: Direct dollar rates from MRF files
- Hospital MRF Estimated Allowed Amounts: Calculated allowed amounts from hospital data
- Percentage with Exact Charge Match: MRF percentages multiplied by exact gross charge matches
- Claims-Based Allowed Amounts: Validated rates from Komodo claims data with sufficient sample size (N > 11)
Tier 2: Exact Code Match with Translation
This tier involves exact code matches but requires data transformation:
- Percentage with Estimated Charge: MRF percentages multiplied by market-average gross charges when exact matches aren't available
- Per Diem to Dollar Conversion: Per diem rates multiplied by Medicare geometric mean length of stay (GLOS)
- APR-DRG to MS-DRG Mapping: APR-DRG rates transformed using TQ's custom crosswalk
- APC to HCPCS Mapping: APC rates mapped to HCPCS codes using CMS reference data
Tier 3: Inferred Provision Match
This tier identifies common patterns in rate structures to infer missing rates:
MS-DRG Base Rates
- Case Rate Inference: When multiple MS-DRG rates divided by their CMS weights share a common base rate, this base rate is used to impute missing MS-DRG rates
- Base Percentage Rates: When multiple MS-DRG percentages share a common value, suggesting a global inpatient percentage rate
HCPCS Outpatient Rates
- Outpatient Procedure Grouper (OPG) Rates: Common rates identified across related outpatient procedure codes
- Outpatient Base Percentage: Shared percentage rates across multiple HCPCS codes
Revenue Code (RC) Based Rates
- RC Global Rates: When 30+ revenue codes share a common rate, indicating a global reimbursement methodology
- RC Family Rates: Specialized rates for specific service areas (NICU, ICU, CCU, etc.) identified through revenue code pattern analysis
Imputation Process
Stage 1: Data Preparation and Chunking
- Payer-Based Partitioning: Rate objects are processed in chunks partitioned by payer_id to optimize parallel processing and memory usage
- Feature Enrichment: Each rate object is enriched with code characteristics (MRF base rate, OPG Grouper)
- Long-Format Transformation: Raw rates are pivoted from wide format (multiple rate columns) to long format for pattern analysis
Stage 2: Statistical Pattern Recognition
Frequency Analysis by Provider-Network-Payer Groups:
-- Example pattern recognition for MS-DRG base rates
SELECT
payer_id, network_id, provider_id,
ROUND(rate_value / cms_weight, 0) as candidate_base_rate,
COUNT(*) as frequency,
COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY payer_id, network_id, provider_id) as coverage_ratio
FROM rates_long
WHERE billing_code_type = 'MS-DRG'
GROUP BY payer_id, network_id, provider_id, candidate_base_rate
RC Family Matching with Jaccard Index:
-- Jaccard Index calculation for RC family matching
jaccard_index = COUNT(intersection_codes) / COUNT(union_codes)
Stage 3: Base Rate Validation and Application
- Primary Validation: Check frequency thresholds and coverage requirements for each candidate base rate
- Rate Reasonableness: Apply business logic checks (e.g., base rates shouldn't be negative, shouldn't exceed 10x Medicare rates)
Rate Application with Multipliers:
-- Example: Applying MS-DRG base rate with CMS weight multiplier
CASE
WHEN msdrg_n_freq >= 10 AND msdrg_coverage_ratio >= 0.90
THEN msdrg_candidate_base_rate * cms_weight
ELSE NULL
END as msdrg_imputed_rate
Stage 4: Gross Charge Integration
Hierarchical Charge Selection for Percentage-Based Imputations:
- Provider-Specific Charges: Use exact provider-code gross charges when available
- CBSA Market Averages: Fall back to metropolitan area averages with CCR adjustments
- State Market Averages: Use state-level averages as final fallback
- National Benchmarks: Apply national averages only for rare codes with insufficient regional data
CCR Adjustment Application:
-- Adjust market-average gross charges based on provider cost-to-charge ratio
adjusted_gross_charge = market_gross_charge * (provider_ccr / market_avg_ccr)
Quality Controls
The imputation system implements multiple layers of quality control to ensure reliability and accuracy:
Statistical Thresholds
Minimum Sample Size Requirements:
- MS-DRG Case Rates: ≥10 observations sharing common base rate
- MS-DRG Percentage Rates: ≥50 observations with ≥90% coverage
- Outpatient Procedure Grouper (OPG): ≥15 codes with ≥70% coverage, minimum 40 total codes in grouper
- Outpatient Percentage Rates: ≥200 observations with ≥90% coverage
- Revenue Code Global Rates: ≥30 distinct revenue codes sharing common rate
Coverage Requirements:
- MS-DRG Imputations: Base rate must represent >90% of available MS-DRG rates for the provider-network-payer combination
- RC Family Matching: Jaccard Index similarity score >25% between observed revenue code array and reference mapping
- Cross-Provider Validation: Base rates flagged if they deviate >3 standard deviations from similar providers
Data Quality Checks
Rate Reasonableness Validation:
-- Example quality control checks
WHERE imputed_rate > 0
AND imputed_rate < 1000000 -- Maximum reasonable rate cap
AND imputed_rate > (medicare_rate * 0.1) -- Minimum 10% of Medicare
AND imputed_rate < (medicare_rate * 20) -- Maximum 20x Medicare
For detailed methodology and examples, see Imputation Tiers.
Accuracy
The Accuracy component evaluates every rate in the CLD pipeline to determine its reliability and assign confidence scores. This process ensures that only the highest-quality rates are selected as canonical rates for each rate object.
Validation Process
The accuracy system operates in two main stages:
Accuracy Assessment
Each rate receives a validation score based on its source and characteristics:
- Source Quality: Raw MRF rates score higher than transformed or imputed rates
- Outlier Detection: Rates are compared against expected distributions to identify outliers
- Benchmark Validation: Rates are cross-referenced with Medicare rates, claims data, and other benchmarks
- Statistical Likelihood: For tied rates, likelihood scores based on expected distributions determine precedence
Business rules are applied to refine accuracy scores:
- Drug Rate Preferences: For pharmaceutical codes, hospital rates are preferred over payer rates when tied
- Provider Type Considerations: Different validation criteria for hospitals, ASCs, imaging centers, etc.
- Code Type Logic: Specialized rules for MS-DRGs, HCPCS, APCs, and other code types
Canonical Rate Score Scale
The final accuracy assessment produces a 1-5 canonical rate score:
| Score | Interpretation | Description |
|---|---|---|
| 5 | Validated | Rate confirmed through cross-validation between Payer + Hospital MRF raw or transformed data |
| 4 | Raw Posted Rate | Direct dollar amount from payer/hospital MRF, not an outlier |
| 3 | Validated Transform/Imputation | Calculated or imputed rate confirmed by benchmark validation |
| 2 | Unvalidated Transform/Imputation | Calculated or imputed rate, reasonable but not benchmark-validated |
| 1 | Outlier | Rate identified as statistical outlier, low confidence |
| 0 | No Rate | No rate data available for this rate object |
Canonical Rate Selection
The canonical rate selection process chooses the "best" rate for each rate object through a hierarchical approach:
Within Sub-Version
- Highest Score Wins: Select the rate with the highest canonical rate score
- Tie-Breaking: When multiple rates have the same score, use likelihood percentiles from the validation score decimals
Across Sub-Versions
When combining multiple sub-versions into prod_combined_all:
- Score Hierarchy: Compare rates across recent sub-versions, selecting highest score
- Rate Type Hierarchy (for ties):
- Posted Rates (direct MRF data)
- Real World Rates (claims-based)
- Enhanced Rates (sophisticated imputations)
- Benchmark Rates (Medicare/reference rates)
Outlier Detection Methods
See Outlier Detection for details on how outliers are identified and handled in the accuracy process.
For implementation details and scoring methodologies, see Accuracy Scores and Canonical Selection.