Production Output
The sub-DAG produces one combined table per month. The orchestrator stitches multiple months together, applies cross-month selection, and publishes the versioned production tables.
Overview
Sub-DAG vs. Orchestrator
Every CLD version covers several months of MRF data (typically 3–4 sub-versions). Each month runs as its own sub-DAG. The orchestrator is a separate DAG that triggers all sub-DAGs, waits for them to complete, then merges results.
Sub-DAG (one per month)
Runs the full pipeline — spines → ROS → raw → transforms → imputations → accuracy → rate selection.
Output: tmp_int_combined_no_whisp_{sub_version}
Then computes 8 whisper tables and joins them.
Output: tmp_int_combined_{sub_version}
Orchestrator (one per version)
Triggers N sub-DAGs in parallel. After all complete, merges all tmp_int_combined_{sub_version} tables.
Applies cross-month canonical selection (same score → rate type hierarchy).
Publishes: prod_combined_abridged, prod_combined_all, rollup views, traceability, external API table.
Sub-DAG Output Flow
- Canonical Rate Selection (chunked by payer) — Rate arrays →
array_max→canonical_rateper ROID. →tmp_int_combined_no_whisp_{sub_version} - Whisper Computation (8 parallel tasks) — Dimensional aggregations at provider, payer, network, code, and cross-dimensional levels. Reference data (scorecards, utilization, NASHP) joined in here. →
tmp_whisper_provider,tmp_whisper_payer,tmp_whisper_code, … - Add Whispers to Main — All 8 whisper tables joined onto the no-whisp table. →
tmp_int_combined_{sub_version}
Orchestrator Merge Flow
- Trigger Sub-DAGs — One sub-DAG per month (sub-version). All run in parallel. Orchestrator waits for all to complete before proceeding. →
tmp_int_combined_{sub_version}× N months - Merge Combined Chunks — All monthly
tmp_int_combinedtables merged and chunked (~35M rows/chunk) for manageability. Cross-month canonical selection: highest score wins; same-score ties use Posted > Real-World > Enhanced > Benchmark. →merged_combined_chunks prod_combined_abridged— Lean, API-ready subset of columns. Used by external APIs (Clear Rates). Built from merged chunks.prod_combined_all— Full column set. Used for internal analytics and QA. Built in parallel with abridged.- Rollup Views + Traceability — Pre-aggregated views for fast analytics queries. Traceability tables link canonical rates back to their MRF source. →
prod_rollup_*,prod_traceability_*
prod_combined_abridged vs prod_combined_all
| Feature | abridged | all |
|---|---|---|
| Row count | Same — one row per ROID across all merged sub-versions | Same |
| Columns | Subset: canonical fields + whisper enrichments + key metadata | Complete: every rate column, score column, and metadata field |
| Raw rate arrays | Not included | Included (full rate_array, rate_score_array, etc.) |
| Primary consumers | External APIs, customer-facing products | Internal QA, analytics, debugging |
Key Fields
| Column | Description | Values |
|---|---|---|
canonical_rate | Best rate for this ROID | Dollar amount (NULL if no rate) |
canonical_rate_score | Confidence score | 0–5 (0=no rate, 5=validated) |
canonical_rate_source | Where rate came from | payer, hospital, payer_hospital, imputation |
canonical_rate_type | Specific column selected, with category prefix | raw: payer_negotiated_rate, transform: …, impute: … |
canonical_rate_subversion | Which monthly sub-version the winning rate came from | e.g. 2024_11 |
canonical_contract_methodology | Hospital's reported methodology | Fee Schedule, Case Rate, Percent, Per Diem |
canonical_rate_class | Rate processing level | Raw, Transform, Impute |
canonical_rate_validation_score | Internal 0–7 accuracy score (full precision) | e.g. 7.00000125 |
NULL handling: When canonical_rate IS NULL, metadata fields (source, type, methodology) are also NULL. canonical_rate_score = 0. Filter on canonical_rate_score > 1 for non-outlier rates, or canonical_rate_score > 0 for any rate.
Common Filters
| Filter | SQL | Use Case |
|---|---|---|
| Non-outlier rates | canonical_rate_score > 1 | Primary analysis — excludes outliers and NULLs |
| Any rate exists | canonical_rate_score > 0 | Coverage counts (includes outliers) |
| No rate | canonical_rate_score = 0 | Gap analysis |
| Validated only | canonical_rate_score = 5 | Highest confidence subset |
| Raw rates only | canonical_rate_class = 'Raw' | Exclude imputations and transforms |
Whispers
Dimensional aggregations computed after rate selection and joined onto each ROID row. Whispers add summary context — national averages, provider-level statistics, utilization counts — without affecting canonical rate selection itself.
Why Whispers?
A raw ROID row tells you: "Aetna PPO pays 12,400, and Mass General's average rate across all payers is $19,200." That context makes the individual rate interpretable.
Whispers are computed after canonical selection and joined onto tmp_int_combined_no_whisp to produce tmp_int_combined. They do not feed into rate selection — no circular dependencies.
8 Whisper Dimensions
| Whisper | Partition | Key Stats | Reference Data Joined |
|---|---|---|---|
| Provider | provider_id | Provider-level rate aggregates, coverage counts | Hospital scorecards, NASHP cost reports, CareJourney quality/cost scores |
| Payer | payer_id | Median rate, provider count, coverage breadth | Payer-level reference data |
| Network | network_id | Median rate, provider count per network | Network metadata |
| Code | billing_code, bill_type | National average rate, estimated national revenue | Komodo utilization (encounter counts by code) |
| Provider-Payer | provider_id, payer_id | How many codes this payer covers at this provider | — |
| Provider-Code | provider_id, billing_code | Provider's rate for this code across all payers | — |
| Payer-Code | payer_id, billing_code | Payer's rate distribution for this code | — |
| Provider-Payer-Network | provider_id, payer_id, network_id | Most granular multi-dimensional statistics | — |
Provider Whisper — Reference Data Detail
The provider whisper is the richest. In addition to rate aggregations from tmp_int_combined_no_whisp, it joins several external datasets:
- Hospital Scorecard: Composite, cash price, list price, and v2 scores per provider
- NASHP Cost Report: Net profit margin, commercial breakeven, payer mix (Medicare, Medicaid, commercial, charity), drug charge ratios — from NASHP hospital cost reports
- CareJourney Scores: Physician counts, quality scores, cost scores per hospital
These enrich every ROID row for that provider — so a query on a single rate at Mass General also exposes Mass General's commercial payer mix and quality score without a separate join.
Code Whisper — Utilization
The code whisper joins Komodo utilization: national_commercial_code_taxonomy_encounters — how many commercial claims were filed nationally for this code. Combined with the national average rate, this gives an estimated national revenue per code:
code_national_estimated_revenue = code_national_avg_canonical_rate × code_national_avg_number_of_claims
Rollup Views
Pre-aggregated views built by the orchestrator for fast analytics queries — without scanning billions of rows in the full production table.
| View | Partition | Key Metrics | Typical Use |
|---|---|---|---|
prod_rollup_provider | provider_id | Total ROIDs, rated ROIDs, avg rate, payer count | Provider rate landscape — how covered is this hospital? |
prod_rollup_payer | payer_id | Total ROIDs, rated ROIDs, median rate, provider count | Payer rate behavior — how generous or stingy is this payer? |
prod_rollup_network | network_id | Median rate, provider count per network | Network-level rate competitiveness |
prod_rollup_code | billing_code, bill_type | Total ROIDs, rated ROIDs, avg rate, provider count | National range for a given service |
prod_rollup_provider_network | provider_id, network_id | Per-contract rate statistics | How does a provider's rate vary across networks? |
prod_rollup_cbsa_code | CBSA, billing_code | Regional pricing aggregates | Geographic pricing variation for a given code |
Rollups are built from prod_combined_abridged after the merge step. Each rollup applies uniqueness and non-null validation checks before publishing.
When to use rollups vs the full table: Rollups are the right starting point for trend analysis, benchmarking, and coverage reporting — they're orders of magnitude faster. Use prod_combined_abridged when you need rate-level detail (e.g., what does Aetna PPO pay for 27447 at Mass General?). Use prod_combined_all when you need the full rate array or internal accuracy scores for debugging.
Traceability
Two traceability tables connect every final canonical rate back to its raw MRF source. Essential for auditing, debugging, and customer support.
Why Traceability?
A user sees canonical_rate = $18,500 for knee replacement at Mass General under Aetna PPO. Traceability answers: which MRF file did that come from? Which exact row in that file? When was it ingested? Was it a payer MRF or hospital MRF?
This matters for:
- Customer disputes: "You're showing 16,000" → trace to the exact source record
- Regulatory compliance: Transparency rules require published rates to be traceable to source files
- Data debugging: When a rate looks wrong, the first step is finding which raw record produced it
Two Traceability Tables
| Table | What It Contains | Built By |
|---|---|---|
prod_traceability_to_raw_data | Maps canonical_rate → source MRF file, raw record ID, ingestion date, payer or hospital source | build_traceability_to_raw_data() in orchestrator |
prod_traceability_formulas | Documents the computation formula for derived rates — e.g., "81% × $22,800 hospital gross charge" for a pct-to-dollar rate | build_traceability_formulas_table() in orchestrator |
Walk-Through: Tracing a Rate
- Canonical rate: $18,500 for MS-DRG 470 (knee replacement), Mass General, Aetna PPO —
canonical_rate_type = "raw: payer_negotiated_rate",canonical_rate_subversion = "2024_11" - Step 1: Join
prod_traceability_to_raw_dataon(roid, canonical_rate_subversion)→ find source record - Step 2: Result: Aetna PPO MRF file
aetna_machine_readable_2024_11.json,row_id 4829301, ingested 2024-11-08
The $18,500 rate came from Aetna's November 2024 MRF file, row 4829301. If the customer disputes it, you can point directly to the published file and the specific record.
Traceability for Derived Rates
For transformed rates (e.g., a pct-to-dollar conversion), prod_traceability_formulas stores the full formula so the rate is reproducible:
canonical_rate_type: "transform: hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol"
formula: 0.81 × 22800 = 18468
components: [payer_rate_pct=81%, gross_charge_source=hospital_mrf, gross_charge=22800]
Traceability runs in the orchestrator, not the sub-DAG. It needs the final merged canonical rates (across all sub-versions) before it can record which sub-version and raw record each canonical rate came from. Both tables cover all sub-versions in the CLD version.