Raw Data Ingestion
Four data sources are ingested, mapped onto the Rate Object Space, and merged into a single combined raw table. "Raw" means: mapped to a ROID but not yet transformed, benchmarked, or imputed.
Overview
The Four Data Sources
| Source | What it contains | Coverage | Key challenge |
|---|---|---|---|
| Payer MRF | Negotiated rates filed by insurers (Transparency in Coverage). Keyed by provider NPI and payer_data_network. | All payer-network-provider combinations the payer reports | PG rates use group NPI or EIN → individual provider_id mapping; code crosswalks needed (APR-DRG, APC, EAPG) |
| Hospital MRF | Rates published by hospitals (Hospital Price Transparency), by plan name. Includes fee schedule, case rate, % billed charges, per diem. | Hospital-specific rates for all hospitals with public files | Plan names must be mapped to CLD network_ids via the Plan Bridge — the most complex step in raw ingestion |
| Komodo allowables | Real-world allowed amounts from claims data at NPI × payer, ZIP3, state, and national levels. | Where CLD has licensed Komodo data; varies by geography/payer | Claims-based (what was actually paid), not the contracted rate |
| Gross charges | Hospital's undiscounted list price from chargemasters. MRF-reported and Komodo-derived, at provider/CBSA/state levels (6 fallbacks). | All hospitals with public chargemasters | Used for pct-to-dollar transformations and as upper-bound benchmarks. Infinity values must be filtered before percentile calculations. |
Ingestion sequence:
- Payer MRF (6 payer chunks) —
core_rates→ NPI→provider_id→ bill type logic → code crosswalks (APR-DRG, APC, EAPG) →tmp_raw_mrf_payer_rates. PG rates take a separate 3-step path: prebuild → pg_mrf → custom_pg_logic. - Hospital MRF (3 provider chunks) — Plan Bridge built first (keyword matching, confidence scoring) →
hospital_ratesjoined to plan bridge →network_idresolved →tmp_raw_mrf_hospital_rates. NDC drug codes crosswalked to HCPCS before hospital build. - Supporting sources — Gross charges (7 state chunks) + Komodo allowables + device gross charges run in parallel.
- NDC preprocessing (once) —
tmp_ndc_derived_hcpcs: normalizes NDC formats, maps to HCPCS via Medispan. Runs before hospital pipeline begins. - Combined raw — All four sources LEFT JOINed onto the Rate Object Space →
tmp_int_combined_raw(partitioned bypayer_id,network_type,bill_type) - Validations + QA — Row count checks, uniqueness tests, non-blocking QA trigger fires to
core_licensable_data_qaDAG.
What "Raw" Means
A rate in tmp_int_combined_raw is raw in the following sense: it has been mapped to a ROID (payer × provider × code × network), but it has not been:
- Transformed (percentage → dollar, per diem → case rate, etc.)
- Benchmarked against Medicare or market references
- Imputed for ROIDs with no direct rate
- Scored for accuracy or outlier status
The combined raw table is wide: it has separate columns for each source and contract methodology (e.g., payer_negotiated_rate, hospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, komodo_allowed_npi, plus 6 gross charge columns). Most of these are NULL for any given ROID.
Chunking Pattern
The same chunk-then-union pattern repeats across all sources:
- Compute chunks:
get_payer_chunks()orget_ros_payer_chunks()queries the ROS to group payers/providers into N buckets based on row counts. - Parallel execution: Each chunk runs as a separate Airflow task via
.expand()(dynamic task mapping). Typical counts: ~6 payer chunks for payer MRF, 3 provider chunks for hospital MRF, 7 state chunks for gross charges. - Union: A union task consolidates chunks into the final table.
Key gotcha: hospital_rates.payer_id is BIGINT while the rest of the pipeline uses VARCHAR. Always CAST(hr.payer_id AS VARCHAR) when joining hospital data to any CLD table.
Payer MRF
Payer-reported negotiated rates from Transparency in Coverage machine-readable files. Standard rates go through one path; Physician Group (PG) rates take a separate 3-step path.
Processing steps:
- Source table:
tq_production.public_{yyyy_mm}.core_rates— rates keyed by provider NPI, billing code,payer_data_network - NPI →
provider_id: UNNESTprovider_spine.npiarray, then joincore_rates.provider_npi— resolves NPI to CLDprovider_id - Bill type logic: Determine Inpatient/Outpatient/Professional from
billing_code_type+billing_class+ drug/lab flags - Normalization: Percentage rates < 1.0 multiplied by 100 (likely decimal form). Modifier counting for dedup priority.
- Code crosswalks: APR-DRG → MS-DRG (averaged across SOI levels), APC → HCPCS (OPPS Addendum B), EAPG → HCPCS. Originals preserved in arrays.
- Output: Filtered, normalized rates matched to ROIDs →
tmp_raw_mrf_payer_rates(6 payer chunks → union)
Handling Multiple Rates per ROID
The same (payer, network, provider, bill_type, billing_code, billing_code_type) can appear multiple times in core_rates — different modifiers, files, or negotiated_types. The rates_ordered CTE deduplicates to one row per negotiated_type using ROW_NUMBER() WHERE rank = 1:
| Priority | Column | Direction | Reason |
|---|---|---|---|
| 1 | drug_billing_class | ASC (lower first) | Prefer Outpatient over Professional for drug codes |
| 2 | lab_billing_class | DESC (higher first) | Prefer Professional over Outpatient for lab codes |
| 3 | modifier_count | DESC (more first) | More specific modifiers take priority; modifier_count is -1 for '26' (professional component), 1e9 for 'TC' or no modifier (global) |
| 4 | negotiated_rate | DESC (higher first) | Tiebreaker — prefer the higher rate among otherwise equal rows |
PARTITION key: (payer_id, network_id, provider_id, negotiated_type, bill_type, billing_code, billing_code_type). Dedup is within each negotiated_type — so a ROID may still have separate rows for negotiated, percentage, fee schedule, etc., which become separate columns in tmp_int_combined_raw.
NPI array join: Provider spine NPI array is UNNESTed, then joined on equality to core_rates.provider_npi. One NPI can resolve to multiple providers (rare but possible — handled by dedup ranking).
Capitation: Records with negotiation_arrangement = 'capitation' are filtered out — capitated rates are per-member-per-month, not comparable to fee-for-service.
Percentage normalization: Some payers report 1.25 meaning 125%. The pipeline multiplies by 100 when rate < 1.0 and negotiated_type = 'percentage'.
APR-DRG averaging: When crosswalking APR-DRG → MS-DRG, rates are AVERAGED across severity of illness levels. This introduces a known upward bias compared to native MS-DRG rates.
PG Rates — 3-Step Path
PG rates in payer MRFs are structured differently: keyed by group EIN or NPI rather than individual provider NPI, and stored in a separate table (core_rates_physician_groups). They flow through three tasks:
build_pg_rates_prebuild()— Resolves group EIN/NPI → CLDprovider_id. Normalizes modifiers, derivespos_designation(Facility/Non-Facility/Global), deduplicates by ranking. Also handles a supplemental path for PG rates posted under hospitalprovider_ids in regularcore_rates. →tmp_raw_pg_rates_prebuildbuild_payer_pg_mrf()— Converts prebuild output to common columns schema. Joins torate_object_space(Professionalbill_typeonly). Deduplicates by jaccard →provider_npi_count→negotiated_rate. →tmp_raw_mrf_payer_rates(PG chunk)apply_custom_pg_logic()— Applies payer-specific business rules for PGs that report rates by specialty rather than NPI, or use non-standard code sets. → Merged into payer MRF union
Provider ID resolution:
core_rates_physician_groups identifies providers by provider_group_id + provider_group_id_type (either 'ein' or 'npi'). The prebuild resolves these to CLD provider_id via two LEFT JOINs, with EIN taking priority:
| Strategy | Join | Notes |
|---|---|---|
| EIN-based | provider_group_id = provider_spine.ein (UNNESTed) WHERE provider_group_id_type = 'ein' | Jaccard joined from SPINES_PROVIDER_PHYSICIAN_GROUPS_ASSOCIATED_TINS |
| NPI-based | provider_group_id = provider_spine.npi (UNNESTed) WHERE provider_group_id_type = 'npi' | Falls back to this if EIN match is NULL; COALESCE(ein_provider_id, npi_provider_id) |
Records where both joins return NULL are dropped.
pos_designation (Facility vs. Non-Facility):
service_code contains | pos_designation | Meaning |
|---|---|---|
'11' only | Non-Facility | Office/clinic setting |
'21', '22', or '23' | Facility | Hospital/inpatient setting |
Both '11' and '21'/'22' | Global | All settings — use as fallback for both |
| None / other | Global | Default when setting is ambiguous |
'11' and payer_id = 643 (UHC) | Global | UHC special case — treats office-only codes as global |
Supplemental path — PG rates under hospital provider IDs:
Some payers post professional rates in regular core_rates keyed by a hospital's provider_id rather than in core_rates_physician_groups. For PGs that are singly affiliated with one health system, the prebuild remaps these:
- Identify PGs that belong to exactly one health system (
HAVING COUNT(DISTINCT health_system_id) = 1) - Join those PGs to their affiliated hospital's
provider_idviaSPINES_PROVIDER_HOSPITAL - Pull
core_ratesrecords wherebilling_class = 'professional'posted under the hospital'sprovider_id - Remap to the PG's
provider_id, assignjaccard = 1.0(highest confidence),pos_designation = 'Global'
Why singly-affiliated only? If a PG works with multiple health systems, remapping hospital rates to the PG would be ambiguous. Only 1:1 affiliations are safe to remap.
Hospital MRF
Hospital-reported rates from price transparency files. Requires the Plan Bridge to resolve network_id, and NDC preprocessing to normalize drug codes.
The Network ID Problem
Hospital MRF data has raw plan_name ("Blue Cross PPO", "Aetna HMO Network 1") but no standardized network_id. Without the Plan Bridge, these rates cannot join the Rate Object Space.
Processing sequence:
- NDC preprocessing (once) — Before any hospital chunk runs: normalize NDC format variants, map to HCPCS via Medispan drug reference. →
tmp_ndc_derived_hcpcs - Plan Bridge — Built before hospital chunks: maps
hospital_rates.id→network_name→network_id→tmp_ref_plan_bridge - Hospital MRF build — JOIN
hospital_ratesONplan_bridge.id→network_name, JOINnetwork_spineONnetwork_name→network_id, match to ROS. Three provider chunks in parallel. →tmp_raw_mrf_hospital_rates
Contract Methodologies
| Methodology | Description | Rate Column | Transformation Needed? |
|---|---|---|---|
| Fee Schedule | Fixed dollar amount per service | hospital_fee_schedule_dollar | No (already dollar) |
| Case Rate | Fixed dollar per episode/stay | hospital_case_rate_dollar | No (already dollar) |
| Percent of Total Billed | % of gross charges | hospital_pct_of_total_billed_charges_pct | Yes → pct-to-dollar |
| Per Diem | Dollar per day of stay | hospital_per_diem_rate | Yes → per-diem-to-case |
Handling Multiple Rates per ROID
A hospital can report the same (provider, network, billing_code) under multiple contract methodologies. The rates_ranked CTE deduplicates to one row per contract_methodology using ROW_NUMBER() WHERE rank = 1:
| Priority | Column | Direction | Reason |
|---|---|---|---|
| 1 | drug_billing_class | ASC (lower first) | Prefer Outpatient over Professional for drug codes |
| 2 | no modifiers + no revenue_code | -1 first | Prefer the cleanest record — no billing_code_modifiers and no revenue_code gets rank -1 (best) |
| 3 | negotiated_rate | DESC (higher first) | Tiebreaker among otherwise equal rows |
PARTITION key: (payer_id, network_id, provider_id, bill_type, billing_code, billing_code_type, contract_methodology). Because dedup is within each methodology, a single ROID can still have rates under multiple methodologies — each becomes its own column family in tmp_int_combined_raw.
negotiated_rate — ordering proxy, not the final rate:
The negotiated_rate column used for ORDER BY is a proxy:
CASE
WHEN negotiated_dollar IS NOT NULL THEN negotiated_dollar
WHEN negotiated_percentage IS NOT NULL THEN negotiated_percentage
WHEN (estimated_allowed_amount > 0
AND estimated_allowed_amount < 10000000) IS NOT NULL
THEN estimated_allowed_amount
ELSE NULL
END AS negotiated_rate
This is a mixed-unit field — its only job is tiebreaking within the same methodology group. The actual rate columns (negotiated_dollar, negotiated_percentage, estimated_allowed_amount) are preserved separately.
payer_id bigint: hospital_rates.payer_id is BIGINT. Every join must use CAST(hr.payer_id AS VARCHAR) = pb.payer_id.
Billing class routing: billing_class = 'Professional' records go to the PG plan bridge, NOT the HOSP bridge. Only the PG bridge handles professional services from hospital_rates.
Setting field: setting = 'Inpatient' vs 'Outpatient' determines bill type. But some hospitals incorrectly classify ALL HCPCS as Inpatient — the pipeline detects these via the hcpcs_exceptions logic.
Plan Bridge
Map raw hospital plan names to standardized network IDs using keyword matching, exclusion filters, and confidence scoring.
Build sequence:
- Prepare base — Filter
hospital_ratesby billing code type, deduplicate viaROW_NUMBER()to prevent array explosion →tmp_ref_plan_bridge_base - Keyword matching + confidence scoring — Standardize plan names (uppercase, remove spaces), match against PPO/HMO/Exchange keywords per payer, apply exclusion filters, compute confidence signals →
tmp_ref_plan_bridge_analysis - Build final mapping — Filter to
is_mapped=True(keyword match OR single plan OR low variability), extract IDs (prefer keyword-matched), UNNEST to one row perhospital_rates.id→tmp_ref_plan_bridge(id→network_name) - Hospital MRF build — JOIN
hospital_ratesONplan_bridge.id, JOINnetwork_spineONnetwork_name→network_idresolved →tmp_raw_mrf_hospital_rates
Keyword Matching
Plan names are standardized (uppercase, spaces removed), then matched against regex keyword lists per network type:
PPO keywords:
- General:
ALLCOMMERCIAL,COMMERCIAL,BROAD,ALLPLAN,ALLPRODUCT - Network type:
PPO,PREFERRED - Payer-specific: Aetna→
COMM|AETNA, BCBS→PREFERRED|ACCESS|CHOICE, Cigna→OAP|OPENACCESSPLUS, UHC→CHOICE|CORE|ALLPAYERAPPENDIX
HMO keywords:
- General:
ALLCOMMERCIAL,COMMERCIAL,BROAD - Network type:
HMO,HEALTHMAINTENANCE
Exclusion filters:
- General:
OUTOF,ADVANTAGE,DUAL,INDEMNITY,OON,MEDICARE,MEDICAID,BEHAV,INTERNATIONAL,FEDERAL,MULTIPLAN,PHCS,GOVERNMENT,INDIVIDUAL - Payer+State: (BCBS, NY)→
BLUECARD|BCBSWNY; (UHC, NATIONAL)→HARP|STARPLUS|MEDIADVAN
Confidence Scoring
For each (provider, payer, billing_code_type, network_type) group, the plan bridge computes signals to assess whether it can be reliably mapped:
| Signal | Meaning |
|---|---|
is_single_plan | Only one distinct plan_name in the group — unambiguous mapping |
is_single_rate | Only one distinct rate value — all plans have the same rate, safe to map |
low_rate_variability | Rate spread < 10% — rates are similar enough to treat as one network |
has_network_plan | At least one plan_name matched a keyword regex — confident network identification |
A group is mapped (is_mapped = true) if ANY of these are true: is_single_plan, is_single_rate, low_rate_variability, or has_network_plan.
ID selection: When has_network_plan = true, only keyword-matched IDs (ids_with_network_plan) are used. Otherwise all IDs in the group are used. This prevents unrelated plans from contaminating the mapping.
Three Plan Bridge Types
| Aspect | HOSP Bridge | RC Bridge | PG Bridge |
|---|---|---|---|
| Billing Codes | DRG, HCPCS, APC, EAPG, NDC | Revenue codes only | HCPCS (professional) |
| Revenue Code | IS NULL | IS NOT NULL | IS NULL |
| Billing Class | All (except prof unless drug) | All | Professional ONLY |
| Modifiers | IS NULL | Any | IS NULL |
| Use Case | Facility charges | Institutional pricing | Physician services |
Multi-Network Payer Resolution
Some payers operate multiple networks in the same state. Example: Kaiser (payer_id=522) has "Kaiser Northern California", "Kaiser Southern California", and "Kaiser Central California" — all PPO in CA. The plan bridge handles this by:
- Multi-network payers are explicitly listed with state→network mappings
- Creates one record per specific network (not one generic PPO)
- Provider state assigns the correct network name
- Plan name keywords disambiguate within a state
Gross Charges
Hospital-reported and claims-based gross charges at provider, CBSA, and state levels. Used for pct-to-dollar transformations.
6 gross charge sources:
| Priority | Source | Column Suffix | Specificity |
|---|---|---|---|
| 1 | MRF Provider | gc_hosp | Provider-specific (best) |
| 2 | Komodo Provider | gc_komodo | Provider-specific (claims) |
| 3 | MRF CBSA Median | gc_hosp_cbsa | Metro area |
| 4 | Komodo CBSA Median | gc_komodo_cbsa | Metro area (claims) |
| 5 | MRF State Median | gc_hosp_state | State-wide |
| 6 | Komodo State Median | gc_komodo_state | State-wide (claims) |
Why 6 sources? Not all providers have MRF gross charges. Claims data fills gaps. CBSA/state medians cover providers without any provider-level data. The accuracy scoring system evaluates transforms from EACH source independently — provider-level transforms score higher than geographic ones.
Infinity values: Before computing APPROX_PERCENTILE for medians, Infinity values must be filtered. Some hospital MRF files contain corrupted gross charges.
State chunking: Gross charges are chunked by state (7 chunks) for parallel processing. The union step reassembles all states.
Komodo Claims Data
Real-world claims data providing allowed amounts, utilization counts, and GLOS estimates at multiple geographic levels.
Data types:
- Allowed amounts — What insurers actually paid per claim. Available at NPI×Payer, ZIP3×Payer, State, and National levels. →
claims_benchmarks_allowables_* - Utilization — How often codes are billed: total claims, unique patients per provider/code. At NPI, State, National levels. →
claims_benchmarks_utilization_* - GLOS Estimates — Geometric mean length of stay from claims data. Used for per-diem transformation validation. →
claims_benchmarks_glos_*
Fallback hierarchy:
| Level | Specificity | Use Case |
|---|---|---|
| NPI × Payer | Most specific | Provider-specific allowed amounts per payer channel |
| ZIP3 × Payer | Geographic | Local area benchmarks by payer type |
| State | Broad | State-wide benchmarks |
| National | Last resort | National average when no local data |
This NPI → ZIP3 → State → National fallback pattern is used throughout the pipeline for any claims-derived signal.
Combined Raw
All four raw sources are LEFT JOINed onto the Rate Object Space into one wide table. Most ROIDs have NULL rates here — they become imputation candidates in the next phases.
LEFT JOIN Semantics
tmp_int_combined_raw starts from the Rate Object Space (all ROIDs) and LEFT JOINs each source. This means:
- Every ROID appears — even those with no rate from any source.
- Source-specific columns are NULL for ROIDs not covered by that source.
- The row count equals the ROS row count (not the number of rates found).
This LEFT JOIN design is what allows downstream phases (Transformations, Imputations, Accuracy) to join by ROID and know they are covering every pricing question, not just the ones with raw data.
Wide Table Structure
The table is wide: each source contributes separate columns by method, so no source overwrites another. Partitioned by payer_id, network_type, bill_type.
| Column group | Examples | From |
|---|---|---|
Payer MRF by negotiated_type | payer_negotiated_rate, payer_fee_schedule_rate, payer_percentage_rate, payer_per_diem_rate, payer_derived_rate | tmp_raw_mrf_payer_rates |
Hospital MRF by contract_methodology × amount type | hospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, hospital_case_rate_dollar, hospital_*_allowed_amount | tmp_raw_mrf_hospital_rates |
| Gross charges (6 levels) | mrf_gross_charge_provider, mrf_gross_charge_cbsa_median, mrf_gross_charge_state_median, komodo_gross_charge_provider, komodo_gross_charge_cbsa_median, komodo_gross_charge_state_median | tmp_raw_gross_charges |
| Komodo benchmarks | komodo_allowed_npi, komodo_allowed_zip3, komodo_allowed_state, komodo_allowed_national, komodo_utilization_npi | tmp_raw_komodo_allowables |
What happens next: Transformations converts percentage/per-diem columns into dollar values. Benchmarks compute Medicare reference rates. Imputations fill ROIDs where all payer and hospital columns are NULL. Accuracy scoring evaluates each non-NULL value. Rate Selection picks one canonical rate per ROID from all these layers.