Spines
Build canonical reference tables defining which payers, networks, providers, and codes are in scope. All downstream phases join against these tables to place rates on the Rate Object Space.
Overview
Build sequence:
- Build Payer + Code + Network spines (parallel) — Payer spine from hardcoded list; code spine from 11-source UNION ALL; network spine from
network_mappings.py(7 steps) →tmp_ref_payer_spine,tmp_ref_code_spine,tmp_ref_network_spine - Build Provider Spine — Waits for payer + code spines. Pulls from
tq_production.spines.spines_provider_{type}tables. Filtered byprovider_typesparam. →tmp_ref_provider_spine - Plausibility Tables — ASC, Code (PG), and DME plausibility built from
core_ratesevidence. No validations — silently feed the Rate Object Space. →tmp_ref_asc_plausibility,tmp_ref_code_plausibility,tmp_ref_dme_plausibility - Provider Spine Manual Additions — Add DPR infusion centers, spine-matched PGs, and synthetic PGs from
pg_identifiers.sql→tmp_ref_provider_spine(updated) - Spine Validations (blocking) — Uniqueness and completeness tests.
bypass_validations=Trueskips this. Failures block the run.
Airflow dependency graph:
payer_spine ─┐
code_spine ─┤─→ asc_plausibility ─→ code_plausibility ─┐
network_spine─┘ │
└──────────────────────→ dme_plausibility ──┤
↓
provider_spine
↓
manual_additions
↓
spine_validations (blocking)
↓
rate_object_space
↓
ros_validations (blocking)
↓
trigger_qa (non-blocking)
The sub-DAG code has a comment "asc_plausibility only needs code_spines" — but the actual Airflow edge waits for all 3 spines (payer + code + network). Always check the actual task graph in __init__.py over comments.
Key concepts:
- Spines gate scope: Adding a payer to
payer.sqlmakes its rates flow through the full pipeline. Removing a provider type from theprovider_typesparam excludes that type from the entire run. Spines are where scope is defined. - Provider ID vs NPI: A
provider_idis CLD's internal identifier. A single provider can have multiple NPIs. The spine aggregates NPIs into an array:npi = ['1234567890', '0987654321']. Payer MRF data (keyed by NPI) is joined viaWHERE cr.provider_npi = ANY(ps.npi). - bypass_validations: When
True, all validation task groups are skipped — useful during development or when validations are known to fail for expected reasons. Never bypass validations in production runs without approval. - lookback_run: When
lookback_run=True, the provider spine excludes ASC, PG, Dialysis, DME, and Urgent Care. Lookback runs re-process historical months with updated payer MRF data only.
Payer Spine
Canonical list of payers CLD processes in this version. A hardcoded Jinja2 list defines the universe — add a payer ID here and its rates flow through the entire pipeline. (~90 payers)
How It Works
payer.sql builds the spine in 4 steps, all within a single SQL file:
- Jinja2 set (pre-SQL): The
{% set payer_ids = [...] %}block is a Jinja2 template directive — it runs before the SQL is sent to Trino. This expands to ~90 UNION ALL rows ofSELECT id AS payer_payer_id. - payer_ids CTE: Materializes the list as SQL rows.
- payers CTE: Maps each
payer_payer_idtodefinitive_payer_id(Definitive Healthcare entity IDs) andblue_payer(BCBS flag). Both are hardcoded CASE/IN expressions — not derived from any lookup table. - Final SELECT: Joins to
tq_production.spines.spines_payerto resolvepayer_name. If a payer ID isn't inspines_payer, it drops silently.
Worked Example: Aetna (payer_id = '42')
- Input:
'42'appears in the Jinja2payer_idslist - definitive_payer_id:
ARRAY[575957, 585334, 575927, 575914, 579862, 575931, 553230, 575890, 575915, 579863, 552470]— 11 Definitive entities because Aetna operates under multiple legal entities - blue_payer:
True— Aetna is in the BCBS IN list - payer_name:
'Aetna'— resolved fromtq_production.spines.spines_payervia JOIN
Final row: payer_id='42', payer_name='Aetna', definitive_payer_id=[575957,...], blue_payer=True
Worked Example: Multiplan (payer_id = '101')
- definitive_payer_id:
NULL— Multiplan is a network rental company, not tracked in Definitive Healthcare - blue_payer:
False
Final row: payer_id='101', payer_name='Multiplan', definitive_payer_id=NULL, blue_payer=False
Key Columns
| Column | Type | Description |
|---|---|---|
payer_id | VARCHAR | CLD's stable payer identifier. Cast from integer in the source. E.g., '42'=Aetna, '76'=Cigna, '643'=UHC, '169'=Humana. |
payer_name | STRING | Human-readable name resolved from tq_production.spines.spines_payer via JOIN. Not stored in payer.sql itself. |
definitive_payer_id | ARRAY[BIGINT] | Definitive Healthcare entity IDs for enrichment joins. NULL for ~10 payers. ARRAY because some payers (Aetna: 11, BCBS: 3, Kaiser: 2) span multiple entities. |
blue_payer | BOOLEAN | True if this is a Blue Cross Blue Shield plan. Hardcoded IN list of ~45 payer IDs. |
definitive_payer_id is an ARRAY because some payers operate under multiple Definitive Healthcare legal entities. When joining downstream (e.g., in whispers), use CROSS JOIN UNNEST(definitive_payer_id) AS t(def_id) to expand.
payer_id is VARCHAR, not INTEGER. The source column is cast with CAST(payer_payer_id AS VARCHAR). When joining to other tables that store payer_id as bigint (e.g., hospital_rates.payer_id), cast explicitly: CAST(hospital_rates.payer_id AS VARCHAR) = payer_spine.payer_id.
Provider Spine
Enumerate all providers by type, with NPI arrays, geographic attributes, and taxonomy groupings. Built from the Spines DAG's per-type source tables (tq_production.spines.spines_provider_{type}).
How It Works
provider.sql has two layers of CTEs per provider type:
- Array NPI CTE (e.g.,
hospital_array_npi): unions the primary NPI table with the additional NPIs table, then aggregates all NPIs into a sorted array withARRAY_SORT(ARRAY_AGG(DISTINCT CAST(npi AS VARCHAR))). - Final provider CTE (e.g.,
hospitals): joins the spine table to the NPI array, adds geo columns (state, cbsa, zip, carrier_locality), taxonomy columns, and a hardcodedprovider_typestring.
The CTEs are assembled at the bottom with a Jinja2-conditional UNION ALL — only types in the provider_types parameter are included:
{% if 'Hospital' in provider_types %} SELECT ... FROM hospitals
{% if 'ASC' in provider_types %} UNION ALL SELECT ... FROM ascs
{% if 'Physician Group' in provider_types %} UNION ALL SELECT ... FROM physician_groups
{% if 'Laboratory' in provider_types %} UNION ALL SELECT ... FROM labs
{% if 'Imaging Center' in provider_types %} UNION ALL SELECT ... FROM imaging_centers
{% if 'Dialysis Center' in provider_types %} UNION ALL SELECT ... FROM dialysis_centers
{% if 'DME' in provider_types %} UNION ALL SELECT ... FROM dme
{% if 'Urgent Care' in provider_types %} UNION ALL SELECT ... FROM urgent_care
Worked Example: Lab Filtering
- Stage 1 — core_rates filter:
labs_in_core_ratesjoins labs tocore_ratesvia NPI. Only labs that have actually appeared in payer MRF data pass through. - Stage 2 — national_payer_coverage override: Labs with
national_payer_coverage = TRUEare included regardless — Quest/LabCorp have national contracts that may not show up in every payer's MRF. - Quest/LabCorp dedup: These chains have many locations sharing the same
provider_idwith different states. The spine disambiguates by appending state:provider_name = 'Quest Diagnostics CA','Quest Diagnostics TX', etc.
Quest Diagnostics CA (payer_id=..., state='CA'), Quest Diagnostics TX (..., state='TX') — separate ROIDs per state.
carrier_locality column: Joined from REF_CMS_ZIP_CARRIER_LOCALITY by zip code. Used by the Benchmarks phase for Medicare Physician Fee Schedule (MPFS) pricing. MPFS rates vary by carrier locality.
Provider Types
| Type | Source Table | Special Handling | Scale |
|---|---|---|---|
| Hospital | spines_provider_hospitals + _additional_npis | Subtypes: Critical Access, Children's, Rehab, etc. CCN (medicare_provider_id). Health system hierarchy. | ~6,000 |
| ASC | spines_provider_ascs + _additional_npis | AmSurg specialty matching. Used for ASC code plausibility. | ~5,500 |
| Physician Group | spines_provider_physician_groups + TINs | Associated TINs array. Health system membership. Code plausibility required. | ~35,000 |
| Laboratory | spines_provider_labs + _additional_npis | Filtered to core_rates presence OR national_payer_coverage=True. Quest/LabCorp deduplicated by state. | ~500 |
| Imaging Center | spines_provider_imaging_centers (no additional NPIs) | Can share codes with PG (infusion center overlap). | ~3,000 |
| Dialysis Center | spines_provider_dialysis_centers + _additional_npis | Added in v2.4.3. Excluded when lookback_run=True. | ~7,000 |
| DME | spines_provider_dme + _additional_npis | DME plausibility required from core_rates evidence. Excluded when lookback_run=True. | ~2,000 |
| Urgent Care | spines_provider_urgent_care + _additional_npis | Added in v2.4.3. Excluded when lookback_run=True. | ~10,000 |
Default provider_types param: ['Hospital', 'ASC', 'Physician Group', 'Laboratory', 'Imaging Center', 'Dialysis Center', 'DME', 'Urgent Care']. When lookback_run=True, only Hospital is included.
Key gotcha: The npi column is an ARRAY[VARCHAR], not a scalar. When joining from core_rates (which has a single NPI per row as a bigint), use WHERE cr.provider_npi = ANY(CAST(ps.npi AS ARRAY(BIGINT))) — both the ARRAY element type and the type cast matter.
Network Spine
All payer-network combinations: PPO, HMO, Exchange, Narrow, Medicare Advantage. Built in 7 sequential SQL steps from manually curated network mappings.
How It Works
All 7 SQL files share the same {% macro network_row(...) %} Jinja2 macro. Each call computes:
network_id:from_big_endian_64(xxhash64(CAST(payer_id || network_name AS varbinary)))— a deterministic 64-bit integernetwork_state: An ARRAY of state strings, or a single-element ARRAY (many networks are national)payer_data_network: An ARRAY of exact MRF plan name stringsnetwork_class/network_type: Hardcoded per file (e.g., 'Commercial' / 'PPO')
The mappings come from network_mappings.py in the DAG config, passed as Airflow variables. The SQL files themselves contain no raw data — they're pure templates.
Worked Example: UHC Choice Plus PPO
network_mappings.pyentry:payer_id='643',network_name='UHC Choice Plus PPO',payer_data_network=['CHOICE PLUS', 'CHOICE PLUS NATIONAL'],states=[NULL](national)network_idcomputed:from_big_endian_64(xxhash64(CAST('643UHC Choice Plus PPO' AS varbinary)))→ same integer every runpayer_data_network:ARRAY['CHOICE PLUS', 'CHOICE PLUS NATIONAL']
Spine row: network_id=X, network_name='UHC Choice Plus PPO', payer_id='643', payer_data_network=['CHOICE PLUS', 'CHOICE PLUS NATIONAL'], network_class='Commercial', network_type='PPO'
Where to make changes: When adding or renaming a network, update two places:
network_mappings.py— the single source of truth for network definitions (payer_id,network_name,payer_data_networkstrings). The network spine SQL is generated directly from this file.variables_ppo.sql—ppo_payer_keywords— keyword strings used to match hospital MRFplan_name. These are matched against a normalizedplan_name:REPLACE(REPLACE(UPPER(plan_name),'_',' '),' ',''). Write keywords ALLCAPS with NO SPACES (e.g.,'CHOICEPLUS', not'Choice Plus').
Silent failure mode: If the payer_data_network string in network_mappings.py doesn't match the MRF's plan_name (for payer MRF) or keywords don't match the normalized plan_name (for hospital MRF), no error is raised — rates simply aren't assigned to that network. Always verify with a research query on hospital_rates.plan_name after adding a network.
Network Types
| SQL File | Network Type | network_class | Build Step | Notes |
|---|---|---|---|---|
network_base.sql | PPO | Commercial | CREATE OR REPLACE | Creates the table; all other steps INSERT into it |
network_hmo.sql | HMO | Commercial | INSERT INTO | Regional HMO plans (e.g., Kaiser CA) |
network_narrow.sql | Narrow | Commercial | INSERT INTO | Narrow/tiered networks (smaller provider panels) |
network_exchange.sql | Exchange | Exchange | INSERT INTO | ACA Marketplace plans — Hospital provider type only in ROS |
network_custom.sql | Custom PPO + HMO | Commercial | INSERT INTO | Custom commercial networks not in standard PPO/HMO lists |
network_medicare_advantage.sql | Medicare Advantage | Medicare Advantage | INSERT INTO | network_class AND network_type both = 'Medicare Advantage' |
network_dedup.sql | — | — | CREATE OR REPLACE | SELECT DISTINCT * removes any duplicate rows from prior INSERTs |
payer_data_network — the MRF join key:
The payer_data_network column is an ARRAY of exact plan name strings as they appear in payer MRF files. This is how the Raw Data phase assigns network_id to payer MRF rates.
- Network spine row:
network_name="UHC Choice Plus PPO",payer_data_network=['CHOICE PLUS', 'CHOICE PLUS NATIONAL'] - Payer MRF rate:
plan_name="CHOICE PLUS"for provider X, code 99213, rate $180 - Join condition:
hospital_rates.plan_name = ANY(network_spine.payer_data_network)
Rate is assigned: (provider X, payer 643, network=UHC Choice Plus PPO, code 99213, bill_type=Professional) → ROID → $180
payer_data_network is for payer MRF (core_rates) only — exact match. The join is hospital_rates.plan_name = ANY(payer_data_network) with no normalization. 'CHOICE PLUS' and 'Choice Plus' are different strings. Hospital MRF (hospital_rates) uses a separate keyword matching path with normalized plan_name (ALLCAPS, no spaces) — see ppo_payer_keywords in variables_ppo.sql.
Deterministic network_id:
Because the hash is deterministic, the same (payer_id, network_name) pair always produces the same network_id — even across different CLD versions, different months, or re-runs. This makes network_id a stable foreign key you can compare across versions.
Stability cuts both ways. If you rename a network (e.g., "Aetna Choice PPO" → "Aetna Choice Plus"), the network_id changes. All historical rates keyed by the old network_id become orphaned. Rename networks only intentionally, and be aware of cross-version comparison implications.
You can compute network_id offline. SELECT from_big_endian_64(xxhash64(CAST('643' || 'UHC Choice Plus PPO' AS varbinary))) and use that value directly in your WHERE clause.
Code Spine
All billing codes in scope with type classification, bill type assignment, and code flags. Built from 11 source-specific codesets via a UNION ALL in code.sql.
How It Works
code.sql is structured in 3 layers:
- Prep CTEs: Compute revenue, filter drug codes, build lab revenue ranking, pull SSP line codes.
all_codesCTE: The 11-source UNION ALL — assembles(bill_type, billing_code_type, billing_code, facility)from every source.- Final SELECT: Deduplicates
all_codes, then LEFT JOINs in metadata:service_description,service_line,therapeutic_area,drug_name_array,is_surg_code,is_drug_code,is_lab_code,is_device_dependent.
facility flag — PG codes are duplicated:
Physician Group codes appear twice in the spine — once with facility=True and once with facility=False. This reflects the Medicare two-rate system:
facility=True: The "facility rate" — lower, because the facility covers overhead. Used when a PG physician sees a patient in a hospital.facility=False: The "non-facility rate" — higher, covering physician's overhead. Used for office visits.
Same HCPCS code 99214 appears twice: (Professional, HCPCS, 99214, facility=True) and (Professional, HCPCS, 99214, facility=False) — each gets its own ROID.
is_surg_code — 3 sources of truth:
A code is flagged surgical if it appears in ANY of:
- OPG (Outpatient Procedure Grouper) schedules from UHC, Aetna, Cigna, or BCBS CA
- HCPCS codes whose first digit is 1–6 (CMS numeric prefix convention for surgical procedures)
- MS-DRGs marked as
type = 'SURG'in CMS MS-DRG weights table
APR-DRG — the SOI suffix pattern:
- Source: APR-DRG base codes (e.g., '302') from SPINES_SERVICES where
service_type='APR-DRG' - CROSS JOIN: Each code is crossed with
['-1', '-2', '-3', '-4']to produce 4 rows - Result: '302' → '302-1' (minor), '302-2' (moderate), '302-3' (major), '302-4' (extreme)
~600 APR-DRG base codes × 4 SOI levels = ~2,400 APR-DRG rows in the code spine.
11 Code Sources
| # | Source | Bill Type | Provider Types | Managed By |
|---|---|---|---|---|
| 1 | All MS-DRG codes (excluding transplant + CAR-T/ECMO) | Inpatient | Hospitals | sub-DAG (SPINES_SERVICES) |
| 2 | APR-DRG × SOI suffix (-1 to -4) | Inpatient | Hospitals (NY/NJ markets) | sub-DAG (SPINES_SERVICES) |
| 3 | OUTPATIENT_CODESET (top ~5K HCPCS by Medicare revenue) | Outpatient | Hospitals | cld-utils DAG (date-stamped table) |
| 4 | ASC_CODESET (ASC-payable procedure codes) | Outpatient | ASCs | cld-utils DAG (date-stamped table) |
| 5 | SSP line codes (Shoppable Service Packages) | Outpatient + Professional | Hospitals | sub-DAG (SSP_SUB_PACKAGES) |
| 6 | J1 HCPCS (OPPS status indicator J1) | Outpatient | Hospitals | sub-DAG (TMP_OPPS_MEDICARE_REFERENCE_TABLE) |
| 7 | Life Sciences drug codes (ASP + OPPS G/K) | Outpatient + Professional | All | sub-DAG (ASP_PRICING_TABLE, TMP_OPPS) |
| 8a | PG codeset — facility=True | Professional | Physician Groups | cld-utils DAG (date-stamped table) |
| 8b | PG codeset — facility=False | Professional | Physician Groups | cld-utils DAG (date-stamped table) |
| 9 | Lab codeset (Labs + PGs) | Professional | Labs + Hospitals + PGs | cld-utils DAG (LAB_HOSPITAL_AND_PGS_CODESET) |
| 10 | Top 750 lab codes by Komodo revenue | Professional | Labs | sub-DAG (CLAIMS_MEDICAL_SERVICE_LINES) |
| 11 | DME codeset | Professional | DME Suppliers | cld-utils DAG (date-stamped table) |
| — | Dialysis + Urgent Care codesets | Professional | Dialysis, Urgent Care | cld-utils DAG (date-stamped tables) |
cld-utils vs sub-DAG: Sources 3, 4, 8, 9, 11 are date-stamped tables (e.g., outpatient_codeset_2026_03_24) built by the cld-utils DAG. After a cld-utils rebuild, update the pointer in params.py. Sources 1, 2, 5, 6, 7, 10 are computed fresh each run from live reference tables.
Code Flags
| Flag | How it's set | Pipeline impact |
|---|---|---|
is_drug_code | billing_code appears in dpr_codes (ASP pricing table UNION OPPS G/K status) | Drug dosage standardization in Transformations; ASP-based outlier bounds (0.8×–10× ASP); separate BRIT accuracy scoring path |
is_lab_code | billing_code in LAB_HOSPITAL_AND_PGS_CODESET OR CLINICAL_LAB_REFERENCE_PRICING | Lab outlier bounds (0.2×–4.5× Medicare fee schedule); lab-specific BRIT accuracy |
is_surg_code | In OPG schedules OR HCPCS prefix 1-6 OR MS-DRG type='SURG' | OPG-based imputations in Imputations phase |
is_device_dependent | billing_code in DEVICE_DEPENDENT_CODES table | Signals that rate may include implantable device cost |
MS-DRG exclusions: Transplant codes (001–017, 019) and CAR-T/ECMO codes (650–652) are excluded — too specialized, too few covered lives. MS-DRG 018 (CAR-T) IS included but restricted to designated CAR-T treatment centers via plausibility filter.
Drug codes are inserted twice: Life Sciences codes appear as both Outpatient and Professional bill_type. PG drug codes (in dpr_codes) are EXCLUDED from source 8 (PG codeset) to avoid double-counting — they're covered by source 7.