Stage 3: Raw Data Ingestion
Four data sources are ingested, mapped onto the Rate Object Space, and merged into a single combined raw table.
Four Data Sources
| Source | What it contains | Coverage | Key challenge |
|---|---|---|---|
| Payer MRF | Negotiated rates filed by insurers (Transparency in Coverage), keyed by provider NPI and payer_data_network | All payer-network-provider combinations the payer reports | PG rates use group NPI or EIN → provider_id mapping; code crosswalks needed |
| Hospital MRF | Rates published by hospitals (Hospital Price Transparency), by plan name. Fee schedule, case rate, % billed, per diem | Hospital-specific rates | Plan names must be mapped to Clear Rates network_ids via Plan Bridge |
| Komodo allowables | Real-world allowed amounts from claims data at NPI × payer, ZIP3, state, national levels | Where Clear Rates has licensed Komodo data | Claims-based (what was actually paid), not the contracted rate |
| Gross charges | Hospital's undiscounted list price from chargemasters. MRF-reported and Komodo-derived, 6 fallbacks | All hospitals with public chargemasters | Used for pct-to-dollar transformations; Infinity values must be filtered before percentile calculations |
Pipeline Flow
1
Payer MRF
core_rates → NPI→provider_id → bill type → code crosswalks (APR-DRG, APC, EAPG) → tmp_raw_mrf_payer_rates. 6 payer chunks in parallel. PG rates take a separate 3-step path via build_pg_rates_prebuild → build_payer_pg_mrf → apply_custom_pg_logic.
→ tmp_raw_mrf_payer_rates
2
Hospital MRF
Plan Bridge first (keyword matching, confidence scoring) → hospital_rates joined → network_id resolved → tmp_raw_mrf_hospital_rates. NDC drug codes crosswalked before hospital build. 3 provider chunks in parallel.
→ tmp_raw_mrf_hospital_rates
3
Supporting Sources
Gross charges (7 state chunks) + Komodo allowables + device gross charges run in parallel.
→ tmp_raw_gross_charges / tmp_raw_komodo
4
NDC Preprocessing
Runs once before the hospital pipeline. Normalizes NDC formats, maps to HCPCS via Medispan.
→ tmp_ref_ndc_crosswalk
5
Combined Raw
All four sources LEFT JOINed onto the Rate Object Space. Most ROIDs have NULL rates — they are gap candidates for imputation.
→ tmp_int_combined_raw
6
Validations + QA
Row count checks, uniqueness tests, non-blocking QA trigger.
→ —
What "Raw" Means
A rate in tmp_int_combined_raw has been:
- Mapped to a ROID (payer + network + provider + code combination)
- Placed in a source-specific column (e.g.
payer_negotiated_rate,hospital_fee_schedule_dollar)
It has not been transformed, benchmarked, imputed, or scored. Most ROIDs have NULL values for most columns — that is expected.
The Wide Raw Table
tmp_int_combined_raw is partitioned by payer_id, network_type, and bill_type. It carries separate columns for each source and methodology:
| Column group | Examples |
|---|---|
| Payer MRF | payer_negotiated_rate, payer_fee_schedule_rate, payer_percentage_rate |
| Hospital MRF | hospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, hospital_case_rate_dollar |
| Komodo | komodo_allowed_npi, komodo_allowed_zip3, komodo_allowed_state, komodo_allowed_national |
| Gross charges | gc_hosp, gc_hosp_cbsa, gc_hosp_state, gc_komodo, gc_komodo_cbsa, gc_komodo_state (6 sources) |
For any given ROID, most of these columns are NULL.
Chunking Pattern
Large source tables are split across parallel Airflow tasks to avoid single-task timeouts.
- Compute chunks —
get_payer_chunks()orget_ros_payer_chunks()groups payers or providers into N buckets - Parallel execution — each chunk runs as a separate Airflow task via
.expand()(dynamic task mapping) - Union — a consolidation task merges all chunk outputs into the final table
Key gotcha: hospital_rates.payer_id is BIGINT
hospital_rates.payer_id is BIGINT while the rest of the pipeline uses VARCHAR. Always CAST(hr.payer_id AS VARCHAR) when joining hospital data to any Clear Rates table.