Skip to main content
Version: 3.0

Raw Data Ingestion

Four data sources are ingested, mapped onto the Rate Object Space, and merged into a single combined raw table. "Raw" means: mapped to a ROID but not yet transformed, benchmarked, or imputed.

Overview

The Four Data Sources

SourceWhat it containsCoverageKey challenge
Payer MRFNegotiated rates filed by insurers (Transparency in Coverage). Keyed by provider NPI and payer_data_network.All payer-network-provider combinations the payer reportsPG rates use group NPI or EIN → individual provider_id mapping; code crosswalks needed (APR-DRG, APC, EAPG)
Hospital MRFRates published by hospitals (Hospital Price Transparency), by plan name. Includes fee schedule, case rate, % billed charges, per diem.Hospital-specific rates for all hospitals with public filesPlan names must be mapped to CLD network_ids via the Plan Bridge — the most complex step in raw ingestion
Komodo allowablesReal-world allowed amounts from claims data at NPI × payer, ZIP3, state, and national levels.Where CLD has licensed Komodo data; varies by geography/payerClaims-based (what was actually paid), not the contracted rate
Gross chargesHospital's undiscounted list price from chargemasters. MRF-reported and Komodo-derived, at provider/CBSA/state levels (6 fallbacks).All hospitals with public chargemastersUsed for pct-to-dollar transformations and as upper-bound benchmarks. Infinity values must be filtered before percentile calculations.

Ingestion sequence:

  1. Payer MRF (6 payer chunks)core_rates → NPI→provider_id → bill type logic → code crosswalks (APR-DRG, APC, EAPG) → tmp_raw_mrf_payer_rates. PG rates take a separate 3-step path: prebuild → pg_mrf → custom_pg_logic.
  2. Hospital MRF (3 provider chunks) — Plan Bridge built first (keyword matching, confidence scoring) → hospital_rates joined to plan bridge → network_id resolved → tmp_raw_mrf_hospital_rates. NDC drug codes crosswalked to HCPCS before hospital build.
  3. Supporting sources — Gross charges (7 state chunks) + Komodo allowables + device gross charges run in parallel.
  4. NDC preprocessing (once)tmp_ndc_derived_hcpcs: normalizes NDC formats, maps to HCPCS via Medispan. Runs before hospital pipeline begins.
  5. Combined raw — All four sources LEFT JOINed onto the Rate Object Space → tmp_int_combined_raw (partitioned by payer_id, network_type, bill_type)
  6. Validations + QA — Row count checks, uniqueness tests, non-blocking QA trigger fires to core_licensable_data_qa DAG.

What "Raw" Means

A rate in tmp_int_combined_raw is raw in the following sense: it has been mapped to a ROID (payer × provider × code × network), but it has not been:

  • Transformed (percentage → dollar, per diem → case rate, etc.)
  • Benchmarked against Medicare or market references
  • Imputed for ROIDs with no direct rate
  • Scored for accuracy or outlier status

The combined raw table is wide: it has separate columns for each source and contract methodology (e.g., payer_negotiated_rate, hospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, komodo_allowed_npi, plus 6 gross charge columns). Most of these are NULL for any given ROID.

Chunking Pattern

The same chunk-then-union pattern repeats across all sources:

  1. Compute chunks: get_payer_chunks() or get_ros_payer_chunks() queries the ROS to group payers/providers into N buckets based on row counts.
  2. Parallel execution: Each chunk runs as a separate Airflow task via .expand() (dynamic task mapping). Typical counts: ~6 payer chunks for payer MRF, 3 provider chunks for hospital MRF, 7 state chunks for gross charges.
  3. Union: A union task consolidates chunks into the final table.
warning

Key gotcha: hospital_rates.payer_id is BIGINT while the rest of the pipeline uses VARCHAR. Always CAST(hr.payer_id AS VARCHAR) when joining hospital data to any CLD table.


Payer MRF

Payer-reported negotiated rates from Transparency in Coverage machine-readable files. Standard rates go through one path; Physician Group (PG) rates take a separate 3-step path.

Processing steps:

  1. Source table: tq_production.public_{yyyy_mm}.core_rates — rates keyed by provider NPI, billing code, payer_data_network
  2. NPI → provider_id: UNNEST provider_spine.npi array, then join core_rates.provider_npi — resolves NPI to CLD provider_id
  3. Bill type logic: Determine Inpatient/Outpatient/Professional from billing_code_type + billing_class + drug/lab flags
  4. Normalization: Percentage rates < 1.0 multiplied by 100 (likely decimal form). Modifier counting for dedup priority.
  5. Code crosswalks: APR-DRG → MS-DRG (averaged across SOI levels), APC → HCPCS (OPPS Addendum B), EAPG → HCPCS. Originals preserved in arrays.
  6. Output: Filtered, normalized rates matched to ROIDs → tmp_raw_mrf_payer_rates (6 payer chunks → union)

Handling Multiple Rates per ROID

The same (payer, network, provider, bill_type, billing_code, billing_code_type) can appear multiple times in core_rates — different modifiers, files, or negotiated_types. The rates_ordered CTE deduplicates to one row per negotiated_type using ROW_NUMBER() WHERE rank = 1:

PriorityColumnDirectionReason
1drug_billing_classASC (lower first)Prefer Outpatient over Professional for drug codes
2lab_billing_classDESC (higher first)Prefer Professional over Outpatient for lab codes
3modifier_countDESC (more first)More specific modifiers take priority; modifier_count is -1 for '26' (professional component), 1e9 for 'TC' or no modifier (global)
4negotiated_rateDESC (higher first)Tiebreaker — prefer the higher rate among otherwise equal rows

PARTITION key: (payer_id, network_id, provider_id, negotiated_type, bill_type, billing_code, billing_code_type). Dedup is within each negotiated_type — so a ROID may still have separate rows for negotiated, percentage, fee schedule, etc., which become separate columns in tmp_int_combined_raw.

warning

NPI array join: Provider spine NPI array is UNNESTed, then joined on equality to core_rates.provider_npi. One NPI can resolve to multiple providers (rare but possible — handled by dedup ranking).

warning

Capitation: Records with negotiation_arrangement = 'capitation' are filtered out — capitated rates are per-member-per-month, not comparable to fee-for-service.

warning

Percentage normalization: Some payers report 1.25 meaning 125%. The pipeline multiplies by 100 when rate < 1.0 and negotiated_type = 'percentage'.

warning

APR-DRG averaging: When crosswalking APR-DRG → MS-DRG, rates are AVERAGED across severity of illness levels. This introduces a known upward bias compared to native MS-DRG rates.

PG Rates — 3-Step Path

PG rates in payer MRFs are structured differently: keyed by group EIN or NPI rather than individual provider NPI, and stored in a separate table (core_rates_physician_groups). They flow through three tasks:

  1. build_pg_rates_prebuild() — Resolves group EIN/NPI → CLD provider_id. Normalizes modifiers, derives pos_designation (Facility/Non-Facility/Global), deduplicates by ranking. Also handles a supplemental path for PG rates posted under hospital provider_ids in regular core_rates. → tmp_raw_pg_rates_prebuild
  2. build_payer_pg_mrf() — Converts prebuild output to common columns schema. Joins to rate_object_space (Professional bill_type only). Deduplicates by jaccard → provider_npi_countnegotiated_rate. → tmp_raw_mrf_payer_rates (PG chunk)
  3. apply_custom_pg_logic() — Applies payer-specific business rules for PGs that report rates by specialty rather than NPI, or use non-standard code sets. → Merged into payer MRF union

Provider ID resolution:

core_rates_physician_groups identifies providers by provider_group_id + provider_group_id_type (either 'ein' or 'npi'). The prebuild resolves these to CLD provider_id via two LEFT JOINs, with EIN taking priority:

StrategyJoinNotes
EIN-basedprovider_group_id = provider_spine.ein (UNNESTed) WHERE provider_group_id_type = 'ein'Jaccard joined from SPINES_PROVIDER_PHYSICIAN_GROUPS_ASSOCIATED_TINS
NPI-basedprovider_group_id = provider_spine.npi (UNNESTed) WHERE provider_group_id_type = 'npi'Falls back to this if EIN match is NULL; COALESCE(ein_provider_id, npi_provider_id)

Records where both joins return NULL are dropped.

pos_designation (Facility vs. Non-Facility):

service_code containspos_designationMeaning
'11' onlyNon-FacilityOffice/clinic setting
'21', '22', or '23'FacilityHospital/inpatient setting
Both '11' and '21'/'22'GlobalAll settings — use as fallback for both
None / otherGlobalDefault when setting is ambiguous
'11' and payer_id = 643 (UHC)GlobalUHC special case — treats office-only codes as global

Supplemental path — PG rates under hospital provider IDs:

Some payers post professional rates in regular core_rates keyed by a hospital's provider_id rather than in core_rates_physician_groups. For PGs that are singly affiliated with one health system, the prebuild remaps these:

  1. Identify PGs that belong to exactly one health system (HAVING COUNT(DISTINCT health_system_id) = 1)
  2. Join those PGs to their affiliated hospital's provider_id via SPINES_PROVIDER_HOSPITAL
  3. Pull core_rates records where billing_class = 'professional' posted under the hospital's provider_id
  4. Remap to the PG's provider_id, assign jaccard = 1.0 (highest confidence), pos_designation = 'Global'
warning

Why singly-affiliated only? If a PG works with multiple health systems, remapping hospital rates to the PG would be ambiguous. Only 1:1 affiliations are safe to remap.


Hospital MRF

Hospital-reported rates from price transparency files. Requires the Plan Bridge to resolve network_id, and NDC preprocessing to normalize drug codes.

The Network ID Problem

Hospital MRF data has raw plan_name ("Blue Cross PPO", "Aetna HMO Network 1") but no standardized network_id. Without the Plan Bridge, these rates cannot join the Rate Object Space.

Processing sequence:

  1. NDC preprocessing (once) — Before any hospital chunk runs: normalize NDC format variants, map to HCPCS via Medispan drug reference. → tmp_ndc_derived_hcpcs
  2. Plan Bridge — Built before hospital chunks: maps hospital_rates.idnetwork_namenetwork_idtmp_ref_plan_bridge
  3. Hospital MRF build — JOIN hospital_rates ON plan_bridge.idnetwork_name, JOIN network_spine ON network_namenetwork_id, match to ROS. Three provider chunks in parallel. → tmp_raw_mrf_hospital_rates

Contract Methodologies

MethodologyDescriptionRate ColumnTransformation Needed?
Fee ScheduleFixed dollar amount per servicehospital_fee_schedule_dollarNo (already dollar)
Case RateFixed dollar per episode/stayhospital_case_rate_dollarNo (already dollar)
Percent of Total Billed% of gross chargeshospital_pct_of_total_billed_charges_pctYes → pct-to-dollar
Per DiemDollar per day of stayhospital_per_diem_rateYes → per-diem-to-case

Handling Multiple Rates per ROID

A hospital can report the same (provider, network, billing_code) under multiple contract methodologies. The rates_ranked CTE deduplicates to one row per contract_methodology using ROW_NUMBER() WHERE rank = 1:

PriorityColumnDirectionReason
1drug_billing_classASC (lower first)Prefer Outpatient over Professional for drug codes
2no modifiers + no revenue_code-1 firstPrefer the cleanest record — no billing_code_modifiers and no revenue_code gets rank -1 (best)
3negotiated_rateDESC (higher first)Tiebreaker among otherwise equal rows

PARTITION key: (payer_id, network_id, provider_id, bill_type, billing_code, billing_code_type, contract_methodology). Because dedup is within each methodology, a single ROID can still have rates under multiple methodologies — each becomes its own column family in tmp_int_combined_raw.

negotiated_rate — ordering proxy, not the final rate:

The negotiated_rate column used for ORDER BY is a proxy:

CASE
WHEN negotiated_dollar IS NOT NULL THEN negotiated_dollar
WHEN negotiated_percentage IS NOT NULL THEN negotiated_percentage
WHEN (estimated_allowed_amount > 0
AND estimated_allowed_amount < 10000000) IS NOT NULL
THEN estimated_allowed_amount
ELSE NULL
END AS negotiated_rate

This is a mixed-unit field — its only job is tiebreaking within the same methodology group. The actual rate columns (negotiated_dollar, negotiated_percentage, estimated_allowed_amount) are preserved separately.

warning

payer_id bigint: hospital_rates.payer_id is BIGINT. Every join must use CAST(hr.payer_id AS VARCHAR) = pb.payer_id.

warning

Billing class routing: billing_class = 'Professional' records go to the PG plan bridge, NOT the HOSP bridge. Only the PG bridge handles professional services from hospital_rates.

warning

Setting field: setting = 'Inpatient' vs 'Outpatient' determines bill type. But some hospitals incorrectly classify ALL HCPCS as Inpatient — the pipeline detects these via the hcpcs_exceptions logic.


Plan Bridge

Map raw hospital plan names to standardized network IDs using keyword matching, exclusion filters, and confidence scoring.

Build sequence:

  1. Prepare base — Filter hospital_rates by billing code type, deduplicate via ROW_NUMBER() to prevent array explosion → tmp_ref_plan_bridge_base
  2. Keyword matching + confidence scoring — Standardize plan names (uppercase, remove spaces), match against PPO/HMO/Exchange keywords per payer, apply exclusion filters, compute confidence signals → tmp_ref_plan_bridge_analysis
  3. Build final mapping — Filter to is_mapped=True (keyword match OR single plan OR low variability), extract IDs (prefer keyword-matched), UNNEST to one row per hospital_rates.idtmp_ref_plan_bridge (idnetwork_name)
  4. Hospital MRF build — JOIN hospital_rates ON plan_bridge.id, JOIN network_spine ON network_namenetwork_id resolved → tmp_raw_mrf_hospital_rates

Keyword Matching

Plan names are standardized (uppercase, spaces removed), then matched against regex keyword lists per network type:

PPO keywords:

  • General: ALLCOMMERCIAL, COMMERCIAL, BROAD, ALLPLAN, ALLPRODUCT
  • Network type: PPO, PREFERRED
  • Payer-specific: Aetna→COMM|AETNA, BCBS→PREFERRED|ACCESS|CHOICE, Cigna→OAP|OPENACCESSPLUS, UHC→CHOICE|CORE|ALLPAYERAPPENDIX

HMO keywords:

  • General: ALLCOMMERCIAL, COMMERCIAL, BROAD
  • Network type: HMO, HEALTHMAINTENANCE

Exclusion filters:

  • General: OUTOF, ADVANTAGE, DUAL, INDEMNITY, OON, MEDICARE, MEDICAID, BEHAV, INTERNATIONAL, FEDERAL, MULTIPLAN, PHCS, GOVERNMENT, INDIVIDUAL
  • Payer+State: (BCBS, NY)→BLUECARD|BCBSWNY; (UHC, NATIONAL)→HARP|STARPLUS|MEDIADVAN

Confidence Scoring

For each (provider, payer, billing_code_type, network_type) group, the plan bridge computes signals to assess whether it can be reliably mapped:

SignalMeaning
is_single_planOnly one distinct plan_name in the group — unambiguous mapping
is_single_rateOnly one distinct rate value — all plans have the same rate, safe to map
low_rate_variabilityRate spread < 10% — rates are similar enough to treat as one network
has_network_planAt least one plan_name matched a keyword regex — confident network identification

A group is mapped (is_mapped = true) if ANY of these are true: is_single_plan, is_single_rate, low_rate_variability, or has_network_plan.

info

ID selection: When has_network_plan = true, only keyword-matched IDs (ids_with_network_plan) are used. Otherwise all IDs in the group are used. This prevents unrelated plans from contaminating the mapping.

Three Plan Bridge Types

AspectHOSP BridgeRC BridgePG Bridge
Billing CodesDRG, HCPCS, APC, EAPG, NDCRevenue codes onlyHCPCS (professional)
Revenue CodeIS NULLIS NOT NULLIS NULL
Billing ClassAll (except prof unless drug)AllProfessional ONLY
ModifiersIS NULLAnyIS NULL
Use CaseFacility chargesInstitutional pricingPhysician services

Multi-Network Payer Resolution

Some payers operate multiple networks in the same state. Example: Kaiser (payer_id=522) has "Kaiser Northern California", "Kaiser Southern California", and "Kaiser Central California" — all PPO in CA. The plan bridge handles this by:

  1. Multi-network payers are explicitly listed with state→network mappings
  2. Creates one record per specific network (not one generic PPO)
  3. Provider state assigns the correct network name
  4. Plan name keywords disambiguate within a state

Gross Charges

Hospital-reported and claims-based gross charges at provider, CBSA, and state levels. Used for pct-to-dollar transformations.

6 gross charge sources:

PrioritySourceColumn SuffixSpecificity
1MRF Providergc_hospProvider-specific (best)
2Komodo Providergc_komodoProvider-specific (claims)
3MRF CBSA Mediangc_hosp_cbsaMetro area
4Komodo CBSA Mediangc_komodo_cbsaMetro area (claims)
5MRF State Mediangc_hosp_stateState-wide
6Komodo State Mediangc_komodo_stateState-wide (claims)
info

Why 6 sources? Not all providers have MRF gross charges. Claims data fills gaps. CBSA/state medians cover providers without any provider-level data. The accuracy scoring system evaluates transforms from EACH source independently — provider-level transforms score higher than geographic ones.

warning

Infinity values: Before computing APPROX_PERCENTILE for medians, Infinity values must be filtered. Some hospital MRF files contain corrupted gross charges.

warning

State chunking: Gross charges are chunked by state (7 chunks) for parallel processing. The union step reassembles all states.


Komodo Claims Data

Real-world claims data providing allowed amounts, utilization counts, and GLOS estimates at multiple geographic levels.

Data types:

  1. Allowed amounts — What insurers actually paid per claim. Available at NPI×Payer, ZIP3×Payer, State, and National levels. → claims_benchmarks_allowables_*
  2. Utilization — How often codes are billed: total claims, unique patients per provider/code. At NPI, State, National levels. → claims_benchmarks_utilization_*
  3. GLOS Estimates — Geometric mean length of stay from claims data. Used for per-diem transformation validation. → claims_benchmarks_glos_*

Fallback hierarchy:

LevelSpecificityUse Case
NPI × PayerMost specificProvider-specific allowed amounts per payer channel
ZIP3 × PayerGeographicLocal area benchmarks by payer type
StateBroadState-wide benchmarks
NationalLast resortNational average when no local data

This NPI → ZIP3 → State → National fallback pattern is used throughout the pipeline for any claims-derived signal.


Combined Raw

All four raw sources are LEFT JOINed onto the Rate Object Space into one wide table. Most ROIDs have NULL rates here — they become imputation candidates in the next phases.

LEFT JOIN Semantics

tmp_int_combined_raw starts from the Rate Object Space (all ROIDs) and LEFT JOINs each source. This means:

  • Every ROID appears — even those with no rate from any source.
  • Source-specific columns are NULL for ROIDs not covered by that source.
  • The row count equals the ROS row count (not the number of rates found).

This LEFT JOIN design is what allows downstream phases (Transformations, Imputations, Accuracy) to join by ROID and know they are covering every pricing question, not just the ones with raw data.

Wide Table Structure

The table is wide: each source contributes separate columns by method, so no source overwrites another. Partitioned by payer_id, network_type, bill_type.

Column groupExamplesFrom
Payer MRF by negotiated_typepayer_negotiated_rate, payer_fee_schedule_rate, payer_percentage_rate, payer_per_diem_rate, payer_derived_ratetmp_raw_mrf_payer_rates
Hospital MRF by contract_methodology × amount typehospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, hospital_case_rate_dollar, hospital_*_allowed_amounttmp_raw_mrf_hospital_rates
Gross charges (6 levels)mrf_gross_charge_provider, mrf_gross_charge_cbsa_median, mrf_gross_charge_state_median, komodo_gross_charge_provider, komodo_gross_charge_cbsa_median, komodo_gross_charge_state_mediantmp_raw_gross_charges
Komodo benchmarkskomodo_allowed_npi, komodo_allowed_zip3, komodo_allowed_state, komodo_allowed_national, komodo_utilization_npitmp_raw_komodo_allowables
info

What happens next: Transformations converts percentage/per-diem columns into dollar values. Benchmarks compute Medicare reference rates. Imputations fill ROIDs where all payer and hospital columns are NULL. Accuracy scoring evaluates each non-NULL value. Rate Selection picks one canonical rate per ROID from all these layers.