Version: 3.0

Raw Data Ingestion

Four data sources are ingested, mapped onto the Rate Object Space, and merged into a single combined raw table. "Raw" means: mapped to a ROID but not yet transformed, benchmarked, or imputed.

Overview

The Four Data Sources

Source	What it contains	Coverage	Key challenge
Payer MRF	Negotiated rates filed by insurers (Transparency in Coverage). Keyed by provider NPI and `payer_data_network`.	All payer-network-provider combinations the payer reports	PG rates use group NPI or EIN → individual `provider_id` mapping; code crosswalks needed (APR-DRG, APC, EAPG)
Hospital MRF	Rates published by hospitals (Hospital Price Transparency), by plan name. Includes fee schedule, case rate, % billed charges, per diem.	Hospital-specific rates for all hospitals with public files	Plan names must be mapped to CLD `network_id`s via the Plan Bridge — the most complex step in raw ingestion
Komodo allowables	Real-world allowed amounts from claims data at NPI × payer, ZIP3, state, and national levels.	Where CLD has licensed Komodo data; varies by geography/payer	Claims-based (what was actually paid), not the contracted rate
Gross charges	Hospital's undiscounted list price from chargemasters. MRF-reported and Komodo-derived, at provider/CBSA/state levels (6 fallbacks).	All hospitals with public chargemasters	Used for pct-to-dollar transformations and as upper-bound benchmarks. Infinity values must be filtered before percentile calculations.

Ingestion sequence:

Payer MRF (6 payer chunks) — core_rates → NPI→provider_id → bill type logic → code crosswalks (APR-DRG, APC, EAPG) → tmp_raw_mrf_payer_rates. PG rates take a separate 3-step path: prebuild → pg_mrf → custom_pg_logic.
Hospital MRF (3 provider chunks) — Plan Bridge built first (keyword matching, confidence scoring) → hospital_rates joined to plan bridge → network_id resolved → tmp_raw_mrf_hospital_rates. NDC drug codes crosswalked to HCPCS before hospital build.
Supporting sources — Gross charges (7 state chunks) + Komodo allowables + device gross charges run in parallel.
NDC preprocessing (once) — tmp_ndc_derived_hcpcs: normalizes NDC formats, maps to HCPCS via Medispan. Runs before hospital pipeline begins.
Combined raw — All four sources LEFT JOINed onto the Rate Object Space → tmp_int_combined_raw (partitioned by payer_id, network_type, bill_type)
Validations + QA — Row count checks, uniqueness tests, non-blocking QA trigger fires to core_licensable_data_qa DAG.

What "Raw" Means

A rate in tmp_int_combined_raw is raw in the following sense: it has been mapped to a ROID (payer × provider × code × network), but it has not been:

Transformed (percentage → dollar, per diem → case rate, etc.)
Benchmarked against Medicare or market references
Imputed for ROIDs with no direct rate
Scored for accuracy or outlier status

The combined raw table is wide: it has separate columns for each source and contract methodology (e.g., payer_negotiated_rate, hospital_fee_schedule_dollar, hospital_pct_of_total_billed_charges_pct, hospital_per_diem_rate, komodo_allowed_npi, plus 6 gross charge columns). Most of these are NULL for any given ROID.

Chunking Pattern

The same chunk-then-union pattern repeats across all sources:

Compute chunks: get_payer_chunks() or get_ros_payer_chunks() queries the ROS to group payers/providers into N buckets based on row counts.
Parallel execution: Each chunk runs as a separate Airflow task via .expand() (dynamic task mapping). Typical counts: ~6 payer chunks for payer MRF, 3 provider chunks for hospital MRF, 7 state chunks for gross charges.
Union: A union task consolidates chunks into the final table.

warning

Key gotcha: hospital_rates.payer_id is BIGINT while the rest of the pipeline uses VARCHAR. Always CAST(hr.payer_id AS VARCHAR) when joining hospital data to any CLD table.

Payer MRF

Payer-reported negotiated rates from Transparency in Coverage machine-readable files. Standard rates go through one path; Physician Group (PG) rates take a separate 3-step path.

Processing steps:

Source table: tq_production.public_{yyyy_mm}.core_rates — rates keyed by provider NPI, billing code, payer_data_network
NPI → provider_id: UNNEST provider_spine.npi array, then join core_rates.provider_npi — resolves NPI to CLD provider_id
Bill type logic: Determine Inpatient/Outpatient/Professional from billing_code_type + billing_class + drug/lab flags
Normalization: Percentage rates < 1.0 multiplied by 100 (likely decimal form). Modifier counting for dedup priority.
Code crosswalks: APR-DRG → MS-DRG (averaged across SOI levels), APC → HCPCS (OPPS Addendum B), EAPG → HCPCS. Originals preserved in arrays.
Output: Filtered, normalized rates matched to ROIDs → tmp_raw_mrf_payer_rates (6 payer chunks → union)

Handling Multiple Rates per ROID

The same (payer, network, provider, bill_type, billing_code, billing_code_type) can appear multiple times in core_rates — different modifiers, files, or negotiated_types. The rates_ordered CTE deduplicates to one row per negotiated_type using ROW_NUMBER() WHERE rank = 1:

Priority	Column	Direction	Reason
1	`drug_billing_class`	ASC (lower first)	Prefer Outpatient over Professional for drug codes
2	`lab_billing_class`	DESC (higher first)	Prefer Professional over Outpatient for lab codes
3	`modifier_count`	DESC (more first)	More specific modifiers take priority; `modifier_count` is -1 for '26' (professional component), 1e9 for 'TC' or no modifier (global)
4	`negotiated_rate`	DESC (higher first)	Tiebreaker — prefer the higher rate among otherwise equal rows

PARTITION key: (payer_id, network_id, provider_id, negotiated_type, bill_type, billing_code, billing_code_type). Dedup is within each negotiated_type — so a ROID may still have separate rows for negotiated, percentage, fee schedule, etc., which become separate columns in tmp_int_combined_raw.

warning

NPI array join: Provider spine NPI array is UNNESTed, then joined on equality to core_rates.provider_npi. One NPI can resolve to multiple providers (rare but possible — handled by dedup ranking).

warning

Capitation: Records with negotiation_arrangement = 'capitation' are filtered out — capitated rates are per-member-per-month, not comparable to fee-for-service.

warning

Percentage normalization: Some payers report 1.25 meaning 125%. The pipeline multiplies by 100 when rate < 1.0 and negotiated_type = 'percentage'.

warning

APR-DRG averaging: When crosswalking APR-DRG → MS-DRG, rates are AVERAGED across severity of illness levels. This introduces a known upward bias compared to native MS-DRG rates.

PG Rates — 3-Step Path

PG rates in payer MRFs are structured differently: keyed by group EIN or NPI rather than individual provider NPI, and stored in a separate table (core_rates_physician_groups). They flow through three tasks:

build_pg_rates_prebuild() — Resolves group EIN/NPI → CLD provider_id. Normalizes modifiers, derives pos_designation (Facility/Non-Facility/Global), deduplicates by ranking. Also handles a supplemental path for PG rates posted under hospital provider_ids in regular core_rates. → tmp_raw_pg_rates_prebuild
build_payer_pg_mrf() — Converts prebuild output to common columns schema. Joins to rate_object_space (Professional bill_type only). Deduplicates by jaccard → provider_npi_count → negotiated_rate. → tmp_raw_mrf_payer_rates (PG chunk)
apply_custom_pg_logic() — Applies payer-specific business rules for PGs that report rates by specialty rather than NPI, or use non-standard code sets. → Merged into payer MRF union

Provider ID resolution:

core_rates_physician_groups identifies providers by provider_group_id + provider_group_id_type (either 'ein' or 'npi'). The prebuild resolves these to CLD provider_id via two LEFT JOINs, with EIN taking priority:

Strategy	Join	Notes
EIN-based	`provider_group_id = provider_spine.ein` (UNNESTed) WHERE `provider_group_id_type = 'ein'`	Jaccard joined from `SPINES_PROVIDER_PHYSICIAN_GROUPS_ASSOCIATED_TINS`
NPI-based	`provider_group_id = provider_spine.npi` (UNNESTed) WHERE `provider_group_id_type = 'npi'`	Falls back to this if EIN match is NULL; `COALESCE(ein_provider_id, npi_provider_id)`

Records where both joins return NULL are dropped.

pos_designation (Facility vs. Non-Facility):

`service_code` contains	`pos_designation`	Meaning
`'11'` only	Non-Facility	Office/clinic setting
`'21'`, `'22'`, or `'23'`	Facility	Hospital/inpatient setting
Both `'11'` and `'21'/'22'`	Global	All settings — use as fallback for both
None / other	Global	Default when setting is ambiguous
`'11'` and `payer_id = 643` (UHC)	Global	UHC special case — treats office-only codes as global

Supplemental path — PG rates under hospital provider IDs:

Some payers post professional rates in regular core_rates keyed by a hospital's provider_id rather than in core_rates_physician_groups. For PGs that are singly affiliated with one health system, the prebuild remaps these:

Identify PGs that belong to exactly one health system (HAVING COUNT(DISTINCT health_system_id) = 1)
Join those PGs to their affiliated hospital's provider_id via SPINES_PROVIDER_HOSPITAL
Pull core_rates records where billing_class = 'professional' posted under the hospital's provider_id
Remap to the PG's provider_id, assign jaccard = 1.0 (highest confidence), pos_designation = 'Global'

warning

Why singly-affiliated only? If a PG works with multiple health systems, remapping hospital rates to the PG would be ambiguous. Only 1:1 affiliations are safe to remap.

Hospital MRF

Hospital-reported rates from price transparency files. Requires the Plan Bridge to resolve network_id, and NDC preprocessing to normalize drug codes.

The Network ID Problem

Hospital MRF data has raw plan_name ("Blue Cross PPO", "Aetna HMO Network 1") but no standardized network_id. Without the Plan Bridge, these rates cannot join the Rate Object Space.

Processing sequence:

NDC preprocessing (once) — Before any hospital chunk runs: normalize NDC format variants, map to HCPCS via Medispan drug reference. → tmp_ndc_derived_hcpcs
Plan Bridge — Built before hospital chunks: maps hospital_rates.id → network_name → network_id → tmp_ref_plan_bridge
Hospital MRF build — JOIN hospital_rates ON plan_bridge.id → network_name, JOIN network_spine ON network_name → network_id, match to ROS. Three provider chunks in parallel. → tmp_raw_mrf_hospital_rates

Contract Methodologies

Methodology	Description	Rate Column	Transformation Needed?
Fee Schedule	Fixed dollar amount per service	`hospital_fee_schedule_dollar`	No (already dollar)
Case Rate	Fixed dollar per episode/stay	`hospital_case_rate_dollar`	No (already dollar)
Percent of Total Billed	% of gross charges	`hospital_pct_of_total_billed_charges_pct`	Yes → pct-to-dollar
Per Diem	Dollar per day of stay	`hospital_per_diem_rate`	Yes → per-diem-to-case

Handling Multiple Rates per ROID

A hospital can report the same (provider, network, billing_code) under multiple contract methodologies. The rates_ranked CTE deduplicates to one row per contract_methodology using ROW_NUMBER() WHERE rank = 1:

Priority	Column	Direction	Reason
1	`drug_billing_class`	ASC (lower first)	Prefer Outpatient over Professional for drug codes
2	no modifiers + no `revenue_code`	-1 first	Prefer the cleanest record — no `billing_code_modifiers` and no `revenue_code` gets rank -1 (best)
3	`negotiated_rate`	DESC (higher first)	Tiebreaker among otherwise equal rows

PARTITION key: (payer_id, network_id, provider_id, bill_type, billing_code, billing_code_type, contract_methodology). Because dedup is within each methodology, a single ROID can still have rates under multiple methodologies — each becomes its own column family in tmp_int_combined_raw.

negotiated_rate — ordering proxy, not the final rate:

The negotiated_rate column used for ORDER BY is a proxy:

CASE
    WHEN negotiated_dollar      IS NOT NULL THEN negotiated_dollar
    WHEN negotiated_percentage  IS NOT NULL THEN negotiated_percentage
    WHEN (estimated_allowed_amount > 0
          AND estimated_allowed_amount < 10000000) IS NOT NULL
                                THEN estimated_allowed_amount
    ELSE NULL
END AS negotiated_rate

This is a mixed-unit field — its only job is tiebreaking within the same methodology group. The actual rate columns (negotiated_dollar, negotiated_percentage, estimated_allowed_amount) are preserved separately.

warning

payer_id bigint: hospital_rates.payer_id is BIGINT. Every join must use CAST(hr.payer_id AS VARCHAR) = pb.payer_id.

warning

Billing class routing: billing_class = 'Professional' records go to the PG plan bridge, NOT the HOSP bridge. Only the PG bridge handles professional services from hospital_rates.

warning

Setting field: setting = 'Inpatient' vs 'Outpatient' determines bill type. But some hospitals incorrectly classify ALL HCPCS as Inpatient — the pipeline detects these via the hcpcs_exceptions logic.

Plan Bridge

Map raw hospital plan names to standardized network IDs using keyword matching, exclusion filters, and confidence scoring.

Build sequence:

Prepare base — Filter hospital_rates by billing code type, deduplicate via ROW_NUMBER() to prevent array explosion → tmp_ref_plan_bridge_base
Keyword matching + confidence scoring — Standardize plan names (uppercase, remove spaces), match against PPO/HMO/Exchange keywords per payer, apply exclusion filters, compute confidence signals → tmp_ref_plan_bridge_analysis
Build final mapping — Filter to is_mapped=True (keyword match OR single plan OR low variability), extract IDs (prefer keyword-matched), UNNEST to one row per hospital_rates.id → tmp_ref_plan_bridge (id → network_name)
Hospital MRF build — JOIN hospital_rates ON plan_bridge.id, JOIN network_spine ON network_name → network_id resolved → tmp_raw_mrf_hospital_rates

Keyword Matching

Plan names are standardized (uppercase, spaces removed), then matched against regex keyword lists per network type:

PPO keywords:

General: ALLCOMMERCIAL, COMMERCIAL, BROAD, ALLPLAN, ALLPRODUCT
Network type: PPO, PREFERRED
Payer-specific: Aetna→COMM|AETNA, BCBS→PREFERRED|ACCESS|CHOICE, Cigna→OAP|OPENACCESSPLUS, UHC→CHOICE|CORE|ALLPAYERAPPENDIX

HMO keywords:

General: ALLCOMMERCIAL, COMMERCIAL, BROAD
Network type: HMO, HEALTHMAINTENANCE

Exclusion filters:

General: OUTOF, ADVANTAGE, DUAL, INDEMNITY, OON, MEDICARE, MEDICAID, BEHAV, INTERNATIONAL, FEDERAL, MULTIPLAN, PHCS, GOVERNMENT, INDIVIDUAL
Payer+State: (BCBS, NY)→BLUECARD|BCBSWNY; (UHC, NATIONAL)→HARP|STARPLUS|MEDIADVAN

Confidence Scoring

For each (provider, payer, billing_code_type, network_type) group, the plan bridge computes signals to assess whether it can be reliably mapped:

Signal	Meaning
`is_single_plan`	Only one distinct `plan_name` in the group — unambiguous mapping
`is_single_rate`	Only one distinct rate value — all plans have the same rate, safe to map
`low_rate_variability`	Rate spread < 10% — rates are similar enough to treat as one network
`has_network_plan`	At least one `plan_name` matched a keyword regex — confident network identification

A group is mapped (is_mapped = true) if ANY of these are true: is_single_plan, is_single_rate, low_rate_variability, or has_network_plan.

info

ID selection: When has_network_plan = true, only keyword-matched IDs (ids_with_network_plan) are used. Otherwise all IDs in the group are used. This prevents unrelated plans from contaminating the mapping.

Three Plan Bridge Types

Aspect	HOSP Bridge	RC Bridge	PG Bridge
Billing Codes	DRG, HCPCS, APC, EAPG, NDC	Revenue codes only	HCPCS (professional)
Revenue Code	IS NULL	IS NOT NULL	IS NULL
Billing Class	All (except prof unless drug)	All	Professional ONLY
Modifiers	IS NULL	Any	IS NULL
Use Case	Facility charges	Institutional pricing	Physician services

Multi-Network Payer Resolution

Some payers operate multiple networks in the same state. Example: Kaiser (payer_id=522) has "Kaiser Northern California", "Kaiser Southern California", and "Kaiser Central California" — all PPO in CA. The plan bridge handles this by:

Multi-network payers are explicitly listed with state→network mappings
Creates one record per specific network (not one generic PPO)
Provider state assigns the correct network name
Plan name keywords disambiguate within a state

Gross Charges

Hospital-reported and claims-based gross charges at provider, CBSA, and state levels. Used for pct-to-dollar transformations.

6 gross charge sources:

Priority	Source	Column Suffix	Specificity
1	MRF Provider	`gc_hosp`	Provider-specific (best)
2	Komodo Provider	`gc_komodo`	Provider-specific (claims)
3	MRF CBSA Median	`gc_hosp_cbsa`	Metro area
4	Komodo CBSA Median	`gc_komodo_cbsa`	Metro area (claims)
5	MRF State Median	`gc_hosp_state`	State-wide
6	Komodo State Median	`gc_komodo_state`	State-wide (claims)

info

Why 6 sources? Not all providers have MRF gross charges. Claims data fills gaps. CBSA/state medians cover providers without any provider-level data. The accuracy scoring system evaluates transforms from EACH source independently — provider-level transforms score higher than geographic ones.

warning

Infinity values: Before computing APPROX_PERCENTILE for medians, Infinity values must be filtered. Some hospital MRF files contain corrupted gross charges.

warning

State chunking: Gross charges are chunked by state (7 chunks) for parallel processing. The union step reassembles all states.

Komodo Claims Data

Real-world claims data providing allowed amounts, utilization counts, and GLOS estimates at multiple geographic levels.

Data types:

Allowed amounts — What insurers actually paid per claim. Available at NPI×Payer, ZIP3×Payer, State, and National levels. → claims_benchmarks_allowables_*
Utilization — How often codes are billed: total claims, unique patients per provider/code. At NPI, State, National levels. → claims_benchmarks_utilization_*
GLOS Estimates — Geometric mean length of stay from claims data. Used for per-diem transformation validation. → claims_benchmarks_glos_*

Fallback hierarchy:

Level	Specificity	Use Case
NPI × Payer	Most specific	Provider-specific allowed amounts per payer channel
ZIP3 × Payer	Geographic	Local area benchmarks by payer type
State	Broad	State-wide benchmarks
National	Last resort	National average when no local data

This NPI → ZIP3 → State → National fallback pattern is used throughout the pipeline for any claims-derived signal.

Combined Raw

All four raw sources are LEFT JOINed onto the Rate Object Space into one wide table. Most ROIDs have NULL rates here — they become imputation candidates in the next phases.

LEFT JOIN Semantics

tmp_int_combined_raw starts from the Rate Object Space (all ROIDs) and LEFT JOINs each source. This means:

Every ROID appears — even those with no rate from any source.
Source-specific columns are NULL for ROIDs not covered by that source.
The row count equals the ROS row count (not the number of rates found).

This LEFT JOIN design is what allows downstream phases (Transformations, Imputations, Accuracy) to join by ROID and know they are covering every pricing question, not just the ones with raw data.

Wide Table Structure

The table is wide: each source contributes separate columns by method, so no source overwrites another. Partitioned by payer_id, network_type, bill_type.

Column group	Examples	From
Payer MRF by `negotiated_type`	`payer_negotiated_rate`, `payer_fee_schedule_rate`, `payer_percentage_rate`, `payer_per_diem_rate`, `payer_derived_rate`	`tmp_raw_mrf_payer_rates`
Hospital MRF by `contract_methodology` × amount type	`hospital_fee_schedule_dollar`, `hospital_pct_of_total_billed_charges_pct`, `hospital_per_diem_rate`, `hospital_case_rate_dollar`, `hospital_*_allowed_amount`	`tmp_raw_mrf_hospital_rates`
Gross charges (6 levels)	`mrf_gross_charge_provider`, `mrf_gross_charge_cbsa_median`, `mrf_gross_charge_state_median`, `komodo_gross_charge_provider`, `komodo_gross_charge_cbsa_median`, `komodo_gross_charge_state_median`	`tmp_raw_gross_charges`
Komodo benchmarks	`komodo_allowed_npi`, `komodo_allowed_zip3`, `komodo_allowed_state`, `komodo_allowed_national`, `komodo_utilization_npi`	`tmp_raw_komodo_allowables`

info

What happens next: Transformations converts percentage/per-diem columns into dollar values. Benchmarks compute Medicare reference rates. Imputations fill ROIDs where all payer and hospital columns are NULL. Accuracy scoring evaluates each non-NULL value. Rate Selection picks one canonical rate per ROID from all these layers.

Overview​

The Four Data Sources​

What "Raw" Means​

Chunking Pattern​

Payer MRF​

Handling Multiple Rates per ROID​

PG Rates — 3-Step Path​

Hospital MRF​

The Network ID Problem​

Contract Methodologies​

Handling Multiple Rates per ROID​

Plan Bridge​

Keyword Matching​

Confidence Scoring​

Three Plan Bridge Types​

Multi-Network Payer Resolution​

Gross Charges​

Komodo Claims Data​

Combined Raw​

LEFT JOIN Semantics​

Wide Table Structure​

On this page:

Overview

The Four Data Sources

What "Raw" Means

Chunking Pattern

Payer MRF

Handling Multiple Rates per ROID

PG Rates — 3-Step Path

Hospital MRF

The Network ID Problem

Contract Methodologies

Handling Multiple Rates per ROID

Plan Bridge

Keyword Matching

Confidence Scoring

Three Plan Bridge Types

Multi-Network Payer Resolution

Gross Charges

Komodo Claims Data

Combined Raw

LEFT JOIN Semantics

Wide Table Structure