Skip to main content
Version: 3.0

Production Output

The sub-DAG produces one combined table per month. The orchestrator stitches multiple months together, applies cross-month selection, and publishes the versioned production tables.

Overview

Sub-DAG vs. Orchestrator

Every CLD version covers several months of MRF data (typically 3–4 sub-versions). Each month runs as its own sub-DAG. The orchestrator is a separate DAG that triggers all sub-DAGs, waits for them to complete, then merges results.

Sub-DAG (one per month)

Runs the full pipeline — spines → ROS → raw → transforms → imputations → accuracy → rate selection.

Output: tmp_int_combined_no_whisp_{sub_version}

Then computes 8 whisper tables and joins them.

Output: tmp_int_combined_{sub_version}

Orchestrator (one per version)

Triggers N sub-DAGs in parallel. After all complete, merges all tmp_int_combined_{sub_version} tables.

Applies cross-month canonical selection (same score → rate type hierarchy).

Publishes: prod_combined_abridged, prod_combined_all, rollup views, traceability, external API table.

Sub-DAG Output Flow

  1. Canonical Rate Selection (chunked by payer) — Rate arrays → array_maxcanonical_rate per ROID. → tmp_int_combined_no_whisp_{sub_version}
  2. Whisper Computation (8 parallel tasks) — Dimensional aggregations at provider, payer, network, code, and cross-dimensional levels. Reference data (scorecards, utilization, NASHP) joined in here. → tmp_whisper_provider, tmp_whisper_payer, tmp_whisper_code, …
  3. Add Whispers to Main — All 8 whisper tables joined onto the no-whisp table. → tmp_int_combined_{sub_version}

Orchestrator Merge Flow

  1. Trigger Sub-DAGs — One sub-DAG per month (sub-version). All run in parallel. Orchestrator waits for all to complete before proceeding. → tmp_int_combined_{sub_version} × N months
  2. Merge Combined Chunks — All monthly tmp_int_combined tables merged and chunked (~35M rows/chunk) for manageability. Cross-month canonical selection: highest score wins; same-score ties use Posted > Real-World > Enhanced > Benchmark. → merged_combined_chunks
  3. prod_combined_abridged — Lean, API-ready subset of columns. Used by external APIs (Clear Rates). Built from merged chunks.
  4. prod_combined_all — Full column set. Used for internal analytics and QA. Built in parallel with abridged.
  5. Rollup Views + Traceability — Pre-aggregated views for fast analytics queries. Traceability tables link canonical rates back to their MRF source. → prod_rollup_*, prod_traceability_*

prod_combined_abridged vs prod_combined_all

Featureabridgedall
Row countSame — one row per ROID across all merged sub-versionsSame
ColumnsSubset: canonical fields + whisper enrichments + key metadataComplete: every rate column, score column, and metadata field
Raw rate arraysNot includedIncluded (full rate_array, rate_score_array, etc.)
Primary consumersExternal APIs, customer-facing productsInternal QA, analytics, debugging

Key Fields

ColumnDescriptionValues
canonical_rateBest rate for this ROIDDollar amount (NULL if no rate)
canonical_rate_scoreConfidence score0–5 (0=no rate, 5=validated)
canonical_rate_sourceWhere rate came frompayer, hospital, payer_hospital, imputation
canonical_rate_typeSpecific column selected, with category prefixraw: payer_negotiated_rate, transform: …, impute: …
canonical_rate_subversionWhich monthly sub-version the winning rate came frome.g. 2024_11
canonical_contract_methodologyHospital's reported methodologyFee Schedule, Case Rate, Percent, Per Diem
canonical_rate_classRate processing levelRaw, Transform, Impute
canonical_rate_validation_scoreInternal 0–7 accuracy score (full precision)e.g. 7.00000125
warning

NULL handling: When canonical_rate IS NULL, metadata fields (source, type, methodology) are also NULL. canonical_rate_score = 0. Filter on canonical_rate_score > 1 for non-outlier rates, or canonical_rate_score > 0 for any rate.

Common Filters

FilterSQLUse Case
Non-outlier ratescanonical_rate_score > 1Primary analysis — excludes outliers and NULLs
Any rate existscanonical_rate_score > 0Coverage counts (includes outliers)
No ratecanonical_rate_score = 0Gap analysis
Validated onlycanonical_rate_score = 5Highest confidence subset
Raw rates onlycanonical_rate_class = 'Raw'Exclude imputations and transforms

Whispers

Dimensional aggregations computed after rate selection and joined onto each ROID row. Whispers add summary context — national averages, provider-level statistics, utilization counts — without affecting canonical rate selection itself.

Why Whispers?

A raw ROID row tells you: "Aetna PPO pays 18,500forkneereplacementatMassGeneral."Awhisperadds:"Thenationalmedianforkneereplacementis18,500 for knee replacement at Mass General." A whisper adds: "The national median for knee replacement is 12,400, and Mass General's average rate across all payers is $19,200." That context makes the individual rate interpretable.

Whispers are computed after canonical selection and joined onto tmp_int_combined_no_whisp to produce tmp_int_combined. They do not feed into rate selection — no circular dependencies.

8 Whisper Dimensions

WhisperPartitionKey StatsReference Data Joined
Providerprovider_idProvider-level rate aggregates, coverage countsHospital scorecards, NASHP cost reports, CareJourney quality/cost scores
Payerpayer_idMedian rate, provider count, coverage breadthPayer-level reference data
Networknetwork_idMedian rate, provider count per networkNetwork metadata
Codebilling_code, bill_typeNational average rate, estimated national revenueKomodo utilization (encounter counts by code)
Provider-Payerprovider_id, payer_idHow many codes this payer covers at this provider
Provider-Codeprovider_id, billing_codeProvider's rate for this code across all payers
Payer-Codepayer_id, billing_codePayer's rate distribution for this code
Provider-Payer-Networkprovider_id, payer_id, network_idMost granular multi-dimensional statistics

Provider Whisper — Reference Data Detail

The provider whisper is the richest. In addition to rate aggregations from tmp_int_combined_no_whisp, it joins several external datasets:

  • Hospital Scorecard: Composite, cash price, list price, and v2 scores per provider
  • NASHP Cost Report: Net profit margin, commercial breakeven, payer mix (Medicare, Medicaid, commercial, charity), drug charge ratios — from NASHP hospital cost reports
  • CareJourney Scores: Physician counts, quality scores, cost scores per hospital

These enrich every ROID row for that provider — so a query on a single rate at Mass General also exposes Mass General's commercial payer mix and quality score without a separate join.

Code Whisper — Utilization

The code whisper joins Komodo utilization: national_commercial_code_taxonomy_encounters — how many commercial claims were filed nationally for this code. Combined with the national average rate, this gives an estimated national revenue per code:

code_national_estimated_revenue = code_national_avg_canonical_rate × code_national_avg_number_of_claims

Rollup Views

Pre-aggregated views built by the orchestrator for fast analytics queries — without scanning billions of rows in the full production table.

ViewPartitionKey MetricsTypical Use
prod_rollup_providerprovider_idTotal ROIDs, rated ROIDs, avg rate, payer countProvider rate landscape — how covered is this hospital?
prod_rollup_payerpayer_idTotal ROIDs, rated ROIDs, median rate, provider countPayer rate behavior — how generous or stingy is this payer?
prod_rollup_networknetwork_idMedian rate, provider count per networkNetwork-level rate competitiveness
prod_rollup_codebilling_code, bill_typeTotal ROIDs, rated ROIDs, avg rate, provider countNational range for a given service
prod_rollup_provider_networkprovider_id, network_idPer-contract rate statisticsHow does a provider's rate vary across networks?
prod_rollup_cbsa_codeCBSA, billing_codeRegional pricing aggregatesGeographic pricing variation for a given code

Rollups are built from prod_combined_abridged after the merge step. Each rollup applies uniqueness and non-null validation checks before publishing.

info

When to use rollups vs the full table: Rollups are the right starting point for trend analysis, benchmarking, and coverage reporting — they're orders of magnitude faster. Use prod_combined_abridged when you need rate-level detail (e.g., what does Aetna PPO pay for 27447 at Mass General?). Use prod_combined_all when you need the full rate array or internal accuracy scores for debugging.


Traceability

Two traceability tables connect every final canonical rate back to its raw MRF source. Essential for auditing, debugging, and customer support.

Why Traceability?

A user sees canonical_rate = $18,500 for knee replacement at Mass General under Aetna PPO. Traceability answers: which MRF file did that come from? Which exact row in that file? When was it ingested? Was it a payer MRF or hospital MRF?

This matters for:

  • Customer disputes: "You're showing 18,500butourcontractsays18,500 but our contract says 16,000" → trace to the exact source record
  • Regulatory compliance: Transparency rules require published rates to be traceable to source files
  • Data debugging: When a rate looks wrong, the first step is finding which raw record produced it

Two Traceability Tables

TableWhat It ContainsBuilt By
prod_traceability_to_raw_dataMaps canonical_rate → source MRF file, raw record ID, ingestion date, payer or hospital sourcebuild_traceability_to_raw_data() in orchestrator
prod_traceability_formulasDocuments the computation formula for derived rates — e.g., "81% × $22,800 hospital gross charge" for a pct-to-dollar ratebuild_traceability_formulas_table() in orchestrator

Walk-Through: Tracing a Rate

  1. Canonical rate: $18,500 for MS-DRG 470 (knee replacement), Mass General, Aetna PPO — canonical_rate_type = "raw: payer_negotiated_rate", canonical_rate_subversion = "2024_11"
  2. Step 1: Join prod_traceability_to_raw_data on (roid, canonical_rate_subversion) → find source record
  3. Step 2: Result: Aetna PPO MRF file aetna_machine_readable_2024_11.json, row_id 4829301, ingested 2024-11-08
Result

The $18,500 rate came from Aetna's November 2024 MRF file, row 4829301. If the customer disputes it, you can point directly to the published file and the specific record.

Traceability for Derived Rates

For transformed rates (e.g., a pct-to-dollar conversion), prod_traceability_formulas stores the full formula so the rate is reproducible:

canonical_rate_type: "transform: hospital_perc_of_total_billed_charges_gc_hosp_perc_to_dol"
formula: 0.81 × 22800 = 18468
components: [payer_rate_pct=81%, gross_charge_source=hospital_mrf, gross_charge=22800]
info

Traceability runs in the orchestrator, not the sub-DAG. It needs the final merged canonical rates (across all sub-versions) before it can record which sub-version and raw record each canonical rate came from. Both tables cover all sub-versions in the CLD version.