Version: 3.0

Production Output

The sub-DAG produces one combined table per month. The orchestrator stitches multiple months together, applies cross-month selection, and publishes the versioned production tables.

Overview

Sub-DAG vs. Orchestrator

Every CLD version covers several months of MRF data (typically 3–4 sub-versions). Each month runs as its own sub-DAG. The orchestrator is a separate DAG that triggers all sub-DAGs, waits for them to complete, then merges results.

Sub-DAG (one per month)

Runs the full pipeline — spines → ROS → raw → transforms → imputations → accuracy → rate selection.

Output: tmp_int_combined_no_whisp_{sub_version}

Then computes 8 whisper tables and joins them.

Output: tmp_int_combined_{sub_version}

Orchestrator (one per version)

Triggers N sub-DAGs in parallel. After all complete, merges all tmp_int_combined_{sub_version} tables.

Applies cross-month canonical selection (same score → rate type hierarchy).

Publishes: prod_combined_abridged, prod_combined_all, rollup views, traceability, external API table.

Sub-DAG Output Flow

Canonical Rate Selection (chunked by payer) — Rate arrays → array_max → canonical_rate per ROID. → tmp_int_combined_no_whisp_{sub_version}
Whisper Computation (8 parallel tasks) — Dimensional aggregations at provider, payer, network, code, and cross-dimensional levels. Reference data (scorecards, utilization, NASHP) joined in here. → tmp_whisper_provider, tmp_whisper_payer, tmp_whisper_code, …
Add Whispers to Main — All 8 whisper tables joined onto the no-whisp table. → tmp_int_combined_{sub_version}

Orchestrator Merge Flow

Trigger Sub-DAGs — One sub-DAG per month (sub-version). All run in parallel. Orchestrator waits for all to complete before proceeding. → tmp_int_combined_{sub_version} × N months
Merge Combined Chunks — All monthly tmp_int_combined tables merged and chunked (~35M rows/chunk) for manageability. Cross-month canonical selection: highest score wins; same-score ties use Posted > Real-World > Enhanced > Benchmark. → merged_combined_chunks
prod_combined_abridged — Lean, API-ready subset of columns. Used by external APIs (Clear Rates). Built from merged chunks.
prod_combined_all — Full column set. Used for internal analytics and QA. Built in parallel with abridged.
Rollup Views + Traceability — Pre-aggregated views for fast analytics queries. Traceability tables link canonical rates back to their MRF source. → prod_rollup_*, prod_traceability_*

prod_combined_abridged vs prod_combined_all

Feature	abridged	all
Row count	Same — one row per ROID across all merged sub-versions	Same
Columns	Subset: canonical fields + whisper enrichments + key metadata	Complete: every rate column, score column, and metadata field
Raw rate arrays	Not included	Included (full `rate_array`, `rate_score_array`, etc.)
Primary consumers	External APIs, customer-facing products	Internal QA, analytics, debugging

Key Fields

Column	Description	Values
`canonical_rate`	Best rate for this ROID	Dollar amount (NULL if no rate)
`canonical_rate_score`	Confidence score	0–5 (0=no rate, 5=validated)
`canonical_rate_source`	Where rate came from	payer, hospital, payer_hospital, imputation
`canonical_rate_type`	Specific column selected, with category prefix	`raw: payer_negotiated_rate`, `transform: …`, `impute: …`
`canonical_rate_subversion`	Which monthly sub-version the winning rate came from	e.g. 2024_11
`canonical_contract_methodology`	Hospital's reported methodology	Fee Schedule, Case Rate, Percent, Per Diem
`canonical_rate_class`	Rate processing level	Raw, Transform, Impute
`canonical_rate_validation_score`	Internal 0–7 accuracy score (full precision)	e.g. 7.00000125

warning

NULL handling: When canonical_rate IS NULL, metadata fields (source, type, methodology) are also NULL. canonical_rate_score = 0. Filter on canonical_rate_score > 1 for non-outlier rates, or canonical_rate_score > 0 for any rate.

Common Filters

Filter	SQL	Use Case
Non-outlier rates	`canonical_rate_score > 1`	Primary analysis — excludes outliers and NULLs
Any rate exists	`canonical_rate_score > 0`	Coverage counts (includes outliers)
No rate	`canonical_rate_score = 0`	Gap analysis
Validated only	`canonical_rate_score = 5`	Highest confidence subset
Raw rates only	`canonical_rate_class = 'Raw'`	Exclude imputations and transforms

Whispers

Dimensional aggregations computed after rate selection and joined onto each ROID row. Whispers add summary context — national averages, provider-level statistics, utilization counts — without affecting canonical rate selection itself.

Why Whispers?

A raw ROID row tells you: "Aetna PPO pays $18,500 for knee replacement at Mass General." A whisper adds: "The national median for knee replacement is$ 12,400, and Mass General's average rate across all payers is $19,200." That context makes the individual rate interpretable.

Whispers are computed after canonical selection and joined onto tmp_int_combined_no_whisp to produce tmp_int_combined. They do not feed into rate selection — no circular dependencies.

8 Whisper Dimensions

Whisper	Partition	Key Stats	Reference Data Joined
Provider	`provider_id`	Provider-level rate aggregates, coverage counts	Hospital scorecards, NASHP cost reports, CareJourney quality/cost scores
Payer	`payer_id`	Median rate, provider count, coverage breadth	Payer-level reference data
Network	`network_id`	Median rate, provider count per network	Network metadata
Code	`billing_code`, `bill_type`	National average rate, estimated national revenue	Komodo utilization (encounter counts by code)
Provider-Payer	`provider_id`, `payer_id`	How many codes this payer covers at this provider	—
Provider-Code	`provider_id`, `billing_code`	Provider's rate for this code across all payers	—
Payer-Code	`payer_id`, `billing_code`	Payer's rate distribution for this code	—
Provider-Payer-Network	`provider_id`, `payer_id`, `network_id`	Most granular multi-dimensional statistics	—

Provider Whisper — Reference Data Detail

The provider whisper is the richest. In addition to rate aggregations from tmp_int_combined_no_whisp, it joins several external datasets:

Hospital Scorecard: Composite, cash price, list price, and v2 scores per provider
NASHP Cost Report: Net profit margin, commercial breakeven, payer mix (Medicare, Medicaid, commercial, charity), drug charge ratios — from NASHP hospital cost reports
CareJourney Scores: Physician counts, quality scores, cost scores per hospital

These enrich every ROID row for that provider — so a query on a single rate at Mass General also exposes Mass General's commercial payer mix and quality score without a separate join.

Code Whisper — Utilization

The code whisper joins Komodo utilization: national_commercial_code_taxonomy_encounters — how many commercial claims were filed nationally for this code. Combined with the national average rate, this gives an estimated national revenue per code:

code_national_estimated_revenue = code_national_avg_canonical_rate × code_national_avg_number_of_claims

Rollup Views

Pre-aggregated views built by the orchestrator for fast analytics queries — without scanning billions of rows in the full production table.

View	Partition	Key Metrics	Typical Use
`prod_rollup_provider`	`provider_id`	Total ROIDs, rated ROIDs, avg rate, payer count	Provider rate landscape — how covered is this hospital?
`prod_rollup_payer`	`payer_id`	Total ROIDs, rated ROIDs, median rate, provider count	Payer rate behavior — how generous or stingy is this payer?
`prod_rollup_network`	`network_id`	Median rate, provider count per network	Network-level rate competitiveness
`prod_rollup_code`	`billing_code`, `bill_type`	Total ROIDs, rated ROIDs, avg rate, provider count	National range for a given service
`prod_rollup_provider_network`	`provider_id`, `network_id`	Per-contract rate statistics	How does a provider's rate vary across networks?
`prod_rollup_cbsa_code`	CBSA, `billing_code`	Regional pricing aggregates	Geographic pricing variation for a given code

Rollups are built from prod_combined_abridged after the merge step. Each rollup applies uniqueness and non-null validation checks before publishing.

info

When to use rollups vs the full table: Rollups are the right starting point for trend analysis, benchmarking, and coverage reporting — they're orders of magnitude faster. Use prod_combined_abridged when you need rate-level detail (e.g., what does Aetna PPO pay for 27447 at Mass General?). Use prod_combined_all when you need the full rate array or internal accuracy scores for debugging.

Traceability

Two traceability tables connect every final canonical rate back to its raw MRF source. Essential for auditing, debugging, and customer support.

Why Traceability?

A user sees canonical_rate = $18,500 for knee replacement at Mass General under Aetna PPO. Traceability answers: which MRF file did that come from? Which exact row in that file? When was it ingested? Was it a payer MRF or hospital MRF?

This matters for:

Customer disputes: "You're showing $18,500 but our contract says$ 16,000" → trace to the exact source record
Regulatory compliance: Transparency rules require published rates to be traceable to source files
Data debugging: When a rate looks wrong, the first step is finding which raw record produced it

Two Traceability Tables

Table	What It Contains	Built By
`prod_traceability_to_raw_data`	Maps `canonical_rate` → source MRF file, raw record ID, ingestion date, payer or hospital source	`build_traceability_to_raw_data()` in orchestrator
`prod_traceability_formulas`	Documents the computation formula for derived rates — e.g., "81% × $22,800 hospital gross charge" for a pct-to-dollar rate	`build_traceability_formulas_table()` in orchestrator

Walk-Through: Tracing a Rate

Canonical rate: $18,500 for MS-DRG 470 (knee replacement), Mass General, Aetna PPO — canonical_rate_type = "raw: payer_negotiated_rate", canonical_rate_subversion = "2024_11"
Step 1: Join prod_traceability_to_raw_data on (roid, canonical_rate_subversion) → find source record
Step 2: Result: Aetna PPO MRF file aetna_machine_readable_2024_11.json, row_id 4829301, ingested 2024-11-08

Result

The $18,500 rate came from Aetna's November 2024 MRF file, row 4829301. If the customer disputes it, you can point directly to the published file and the specific record.

Traceability for Derived Rates

For transformed rates (e.g., a pct-to-dollar conversion), prod_traceability_formulas stores the full formula so the rate is reproducible:

canonical_rate_type: "transform: hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol"
formula: 0.81 × 22800 = 18468
components: [payer_rate_pct=81%, gross_charge_source=hospital_mrf, gross_charge=22800]

info

Traceability runs in the orchestrator, not the sub-DAG. It needs the final merged canonical rates (across all sub-versions) before it can record which sub-version and raw record each canonical rate came from. Both tables cover all sub-versions in the CLD version.

Overview​

Sub-DAG vs. Orchestrator​

Sub-DAG (one per month)​

Orchestrator (one per version)​

Sub-DAG Output Flow​

Orchestrator Merge Flow​

prod_combined_abridged vs prod_combined_all​

Key Fields​

Common Filters​

Whispers​

Why Whispers?​

8 Whisper Dimensions​

Provider Whisper — Reference Data Detail​

Code Whisper — Utilization​

Rollup Views​

Traceability​

Why Traceability?​

Two Traceability Tables​

Walk-Through: Tracing a Rate​

Traceability for Derived Rates​

On this page: