Skip to main content
Version: 3.0

Stage 3: Raw Data Ingestion

Four data sources are ingested, mapped onto the Rate Object Space, and merged into a single combined raw table.

Four Data Sources

SourceWhat it containsCoverageKey challenge
Payer MRFNegotiated rates filed by insurers (Transparency in Coverage), keyed by provider NPI and payer_data_networkAll payer-network-provider combinations the payer reportsPG rates use group NPI or EIN → provider_id mapping; code crosswalks needed
Hospital MRFRates published by hospitals (Hospital Price Transparency), by plan name. Fee schedule, case rate, % billed, per diemHospital-specific ratesPlan names must be mapped to Clear Rates network_ids via Plan Bridge
Komodo allowablesReal-world allowed amounts from claims data at NPI × payer, ZIP3, state, national levelsWhere Clear Rates has licensed Komodo dataClaims-based (what was actually paid), not the contracted rate
Gross chargesHospital's undiscounted list price from chargemasters. MRF-reported and Komodo-derived, 6 fallbacksAll hospitals with public chargemastersUsed for pct-to-dollar transformations; Infinity values must be filtered before percentile calculations

Pipeline Flow

1
Payer MRF
core_rates → NPI→provider_id → bill type → code crosswalks (APR-DRG, APC, EAPG) → tmp_raw_mrf_payer_rates. 6 payer chunks in parallel. PG rates take a separate 3-step path via build_pg_rates_prebuild → build_payer_pg_mrf → apply_custom_pg_logic.
tmp_raw_mrf_payer_rates
2
Hospital MRF
Plan Bridge first (keyword matching, confidence scoring) → hospital_rates joined → network_id resolved → tmp_raw_mrf_hospital_rates. NDC drug codes crosswalked before hospital build. 3 provider chunks in parallel.
tmp_raw_mrf_hospital_rates
3
Supporting Sources
Gross charges (7 state chunks) + Komodo allowables + device gross charges run in parallel.
tmp_raw_gross_charges / tmp_raw_komodo
4
NDC Preprocessing
Runs once before the hospital pipeline. Normalizes NDC formats, maps to HCPCS via Medispan.
tmp_ref_ndc_crosswalk
5
Combined Raw
All four sources LEFT JOINed onto the Rate Object Space. Most ROIDs have NULL rates — they are gap candidates for imputation.
tmp_int_combined_raw
6
Validations + QA
Row count checks, uniqueness tests, non-blocking QA trigger.

What "Raw" Means

A rate in tmp_int_combined_raw has been:

  • Mapped to a ROID (payer + network + provider + code combination)
  • Placed in a source-specific column (e.g. payer_negotiated_rate, hospital_fee_schedule_dollar)

It has not been transformed, benchmarked, imputed, or scored. Most ROIDs have NULL values for most columns — that is expected.

The Wide Raw Table

tmp_int_combined_raw is partitioned by payer_id, network_type, and bill_type. It carries separate columns for each source and methodology:

Column groupExamples
Payer MRFpayer_negotiated_rate, payer_fee_schedule_rate, payer_percentage_rate
Hospital MRFhospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, hospital_case_rate_dollar
Komodokomodo_allowed_npi, komodo_allowed_zip3, komodo_allowed_state, komodo_allowed_national
Gross chargesgc_hosp, gc_hosp_cbsa, gc_hosp_state, gc_komodo, gc_komodo_cbsa, gc_komodo_state (6 sources)

For any given ROID, most of these columns are NULL.

Chunking Pattern

Large source tables are split across parallel Airflow tasks to avoid single-task timeouts.

  1. Compute chunksget_payer_chunks() or get_ros_payer_chunks() groups payers or providers into N buckets
  2. Parallel execution — each chunk runs as a separate Airflow task via .expand() (dynamic task mapping)
  3. Union — a consolidation task merges all chunk outputs into the final table
Key gotcha: hospital_rates.payer_id is BIGINT

hospital_rates.payer_id is BIGINT while the rest of the pipeline uses VARCHAR. Always CAST(hr.payer_id AS VARCHAR) when joining hospital data to any Clear Rates table.