Skip to main content
Version: 3.0

Spines

Build canonical reference tables defining which payers, networks, providers, and codes are in scope. All downstream phases join against these tables to place rates on the Rate Object Space.

Overview

Build sequence:

  1. Build Payer + Code + Network spines (parallel) — Payer spine from hardcoded list; code spine from 11-source UNION ALL; network spine from network_mappings.py (7 steps) → tmp_ref_payer_spine, tmp_ref_code_spine, tmp_ref_network_spine
  2. Build Provider Spine — Waits for payer + code spines. Pulls from tq_production.spines.spines_provider_{type} tables. Filtered by provider_types param. → tmp_ref_provider_spine
  3. Plausibility Tables — ASC, Code (PG), and DME plausibility built from core_rates evidence. No validations — silently feed the Rate Object Space. → tmp_ref_asc_plausibility, tmp_ref_code_plausibility, tmp_ref_dme_plausibility
  4. Provider Spine Manual Additions — Add DPR infusion centers, spine-matched PGs, and synthetic PGs from pg_identifiers.sqltmp_ref_provider_spine (updated)
  5. Spine Validations (blocking) — Uniqueness and completeness tests. bypass_validations=True skips this. Failures block the run.

Airflow dependency graph:

payer_spine  ─┐
code_spine ─┤─→ asc_plausibility ─→ code_plausibility ─┐
network_spine─┘ │
└──────────────────────→ dme_plausibility ──┤

provider_spine

manual_additions

spine_validations (blocking)

rate_object_space

ros_validations (blocking)

trigger_qa (non-blocking)
warning

The sub-DAG code has a comment "asc_plausibility only needs code_spines" — but the actual Airflow edge waits for all 3 spines (payer + code + network). Always check the actual task graph in __init__.py over comments.

Key concepts:

  • Spines gate scope: Adding a payer to payer.sql makes its rates flow through the full pipeline. Removing a provider type from the provider_types param excludes that type from the entire run. Spines are where scope is defined.
  • Provider ID vs NPI: A provider_id is CLD's internal identifier. A single provider can have multiple NPIs. The spine aggregates NPIs into an array: npi = ['1234567890', '0987654321']. Payer MRF data (keyed by NPI) is joined via WHERE cr.provider_npi = ANY(ps.npi).
  • bypass_validations: When True, all validation task groups are skipped — useful during development or when validations are known to fail for expected reasons. Never bypass validations in production runs without approval.
  • lookback_run: When lookback_run=True, the provider spine excludes ASC, PG, Dialysis, DME, and Urgent Care. Lookback runs re-process historical months with updated payer MRF data only.

Payer Spine

Canonical list of payers CLD processes in this version. A hardcoded Jinja2 list defines the universe — add a payer ID here and its rates flow through the entire pipeline. (~90 payers)

How It Works

payer.sql builds the spine in 4 steps, all within a single SQL file:

  1. Jinja2 set (pre-SQL): The {% set payer_ids = [...] %} block is a Jinja2 template directive — it runs before the SQL is sent to Trino. This expands to ~90 UNION ALL rows of SELECT id AS payer_payer_id.
  2. payer_ids CTE: Materializes the list as SQL rows.
  3. payers CTE: Maps each payer_payer_id to definitive_payer_id (Definitive Healthcare entity IDs) and blue_payer (BCBS flag). Both are hardcoded CASE/IN expressions — not derived from any lookup table.
  4. Final SELECT: Joins to tq_production.spines.spines_payer to resolve payer_name. If a payer ID isn't in spines_payer, it drops silently.

Worked Example: Aetna (payer_id = '42')

  1. Input: '42' appears in the Jinja2 payer_ids list
  2. definitive_payer_id: ARRAY[575957, 585334, 575927, 575914, 579862, 575931, 553230, 575890, 575915, 579863, 552470] — 11 Definitive entities because Aetna operates under multiple legal entities
  3. blue_payer: True — Aetna is in the BCBS IN list
  4. payer_name: 'Aetna' — resolved from tq_production.spines.spines_payer via JOIN
Result

Final row: payer_id='42', payer_name='Aetna', definitive_payer_id=[575957,...], blue_payer=True

Worked Example: Multiplan (payer_id = '101')

  • definitive_payer_id: NULL — Multiplan is a network rental company, not tracked in Definitive Healthcare
  • blue_payer: False
Result

Final row: payer_id='101', payer_name='Multiplan', definitive_payer_id=NULL, blue_payer=False

Key Columns

ColumnTypeDescription
payer_idVARCHARCLD's stable payer identifier. Cast from integer in the source. E.g., '42'=Aetna, '76'=Cigna, '643'=UHC, '169'=Humana.
payer_nameSTRINGHuman-readable name resolved from tq_production.spines.spines_payer via JOIN. Not stored in payer.sql itself.
definitive_payer_idARRAY[BIGINT]Definitive Healthcare entity IDs for enrichment joins. NULL for ~10 payers. ARRAY because some payers (Aetna: 11, BCBS: 3, Kaiser: 2) span multiple entities.
blue_payerBOOLEANTrue if this is a Blue Cross Blue Shield plan. Hardcoded IN list of ~45 payer IDs.
info

definitive_payer_id is an ARRAY because some payers operate under multiple Definitive Healthcare legal entities. When joining downstream (e.g., in whispers), use CROSS JOIN UNNEST(definitive_payer_id) AS t(def_id) to expand.

warning

payer_id is VARCHAR, not INTEGER. The source column is cast with CAST(payer_payer_id AS VARCHAR). When joining to other tables that store payer_id as bigint (e.g., hospital_rates.payer_id), cast explicitly: CAST(hospital_rates.payer_id AS VARCHAR) = payer_spine.payer_id.


Provider Spine

Enumerate all providers by type, with NPI arrays, geographic attributes, and taxonomy groupings. Built from the Spines DAG's per-type source tables (tq_production.spines.spines_provider_{type}).

How It Works

provider.sql has two layers of CTEs per provider type:

  1. Array NPI CTE (e.g., hospital_array_npi): unions the primary NPI table with the additional NPIs table, then aggregates all NPIs into a sorted array with ARRAY_SORT(ARRAY_AGG(DISTINCT CAST(npi AS VARCHAR))).
  2. Final provider CTE (e.g., hospitals): joins the spine table to the NPI array, adds geo columns (state, cbsa, zip, carrier_locality), taxonomy columns, and a hardcoded provider_type string.

The CTEs are assembled at the bottom with a Jinja2-conditional UNION ALL — only types in the provider_types parameter are included:

{% if 'Hospital' in provider_types %}        SELECT ... FROM hospitals
{% if 'ASC' in provider_types %} UNION ALL SELECT ... FROM ascs
{% if 'Physician Group' in provider_types %} UNION ALL SELECT ... FROM physician_groups
{% if 'Laboratory' in provider_types %} UNION ALL SELECT ... FROM labs
{% if 'Imaging Center' in provider_types %} UNION ALL SELECT ... FROM imaging_centers
{% if 'Dialysis Center' in provider_types %} UNION ALL SELECT ... FROM dialysis_centers
{% if 'DME' in provider_types %} UNION ALL SELECT ... FROM dme
{% if 'Urgent Care' in provider_types %} UNION ALL SELECT ... FROM urgent_care

Worked Example: Lab Filtering

  1. Stage 1 — core_rates filter: labs_in_core_rates joins labs to core_rates via NPI. Only labs that have actually appeared in payer MRF data pass through.
  2. Stage 2 — national_payer_coverage override: Labs with national_payer_coverage = TRUE are included regardless — Quest/LabCorp have national contracts that may not show up in every payer's MRF.
  3. Quest/LabCorp dedup: These chains have many locations sharing the same provider_id with different states. The spine disambiguates by appending state: provider_name = 'Quest Diagnostics CA', 'Quest Diagnostics TX', etc.
Result

Quest Diagnostics CA (payer_id=..., state='CA'), Quest Diagnostics TX (..., state='TX') — separate ROIDs per state.

carrier_locality column: Joined from REF_CMS_ZIP_CARRIER_LOCALITY by zip code. Used by the Benchmarks phase for Medicare Physician Fee Schedule (MPFS) pricing. MPFS rates vary by carrier locality.

Provider Types

TypeSource TableSpecial HandlingScale
Hospitalspines_provider_hospitals + _additional_npisSubtypes: Critical Access, Children's, Rehab, etc. CCN (medicare_provider_id). Health system hierarchy.~6,000
ASCspines_provider_ascs + _additional_npisAmSurg specialty matching. Used for ASC code plausibility.~5,500
Physician Groupspines_provider_physician_groups + TINsAssociated TINs array. Health system membership. Code plausibility required.~35,000
Laboratoryspines_provider_labs + _additional_npisFiltered to core_rates presence OR national_payer_coverage=True. Quest/LabCorp deduplicated by state.~500
Imaging Centerspines_provider_imaging_centers (no additional NPIs)Can share codes with PG (infusion center overlap).~3,000
Dialysis Centerspines_provider_dialysis_centers + _additional_npisAdded in v2.4.3. Excluded when lookback_run=True.~7,000
DMEspines_provider_dme + _additional_npisDME plausibility required from core_rates evidence. Excluded when lookback_run=True.~2,000
Urgent Carespines_provider_urgent_care + _additional_npisAdded in v2.4.3. Excluded when lookback_run=True.~10,000
info

Default provider_types param: ['Hospital', 'ASC', 'Physician Group', 'Laboratory', 'Imaging Center', 'Dialysis Center', 'DME', 'Urgent Care']. When lookback_run=True, only Hospital is included.

warning

Key gotcha: The npi column is an ARRAY[VARCHAR], not a scalar. When joining from core_rates (which has a single NPI per row as a bigint), use WHERE cr.provider_npi = ANY(CAST(ps.npi AS ARRAY(BIGINT))) — both the ARRAY element type and the type cast matter.


Network Spine

All payer-network combinations: PPO, HMO, Exchange, Narrow, Medicare Advantage. Built in 7 sequential SQL steps from manually curated network mappings.

How It Works

All 7 SQL files share the same {% macro network_row(...) %} Jinja2 macro. Each call computes:

  • network_id: from_big_endian_64(xxhash64(CAST(payer_id || network_name AS varbinary))) — a deterministic 64-bit integer
  • network_state: An ARRAY of state strings, or a single-element ARRAY (many networks are national)
  • payer_data_network: An ARRAY of exact MRF plan name strings
  • network_class / network_type: Hardcoded per file (e.g., 'Commercial' / 'PPO')

The mappings come from network_mappings.py in the DAG config, passed as Airflow variables. The SQL files themselves contain no raw data — they're pure templates.

Worked Example: UHC Choice Plus PPO

  1. network_mappings.py entry: payer_id='643', network_name='UHC Choice Plus PPO', payer_data_network=['CHOICE PLUS', 'CHOICE PLUS NATIONAL'], states=[NULL] (national)
  2. network_id computed: from_big_endian_64(xxhash64(CAST('643UHC Choice Plus PPO' AS varbinary))) → same integer every run
  3. payer_data_network: ARRAY['CHOICE PLUS', 'CHOICE PLUS NATIONAL']
Result

Spine row: network_id=X, network_name='UHC Choice Plus PPO', payer_id='643', payer_data_network=['CHOICE PLUS', 'CHOICE PLUS NATIONAL'], network_class='Commercial', network_type='PPO'

Where to make changes: When adding or renaming a network, update two places:

  1. network_mappings.py — the single source of truth for network definitions (payer_id, network_name, payer_data_network strings). The network spine SQL is generated directly from this file.
  2. variables_ppo.sqlppo_payer_keywords — keyword strings used to match hospital MRF plan_name. These are matched against a normalized plan_name: REPLACE(REPLACE(UPPER(plan_name),'_',' '),' ',''). Write keywords ALLCAPS with NO SPACES (e.g., 'CHOICEPLUS', not 'Choice Plus').
warning

Silent failure mode: If the payer_data_network string in network_mappings.py doesn't match the MRF's plan_name (for payer MRF) or keywords don't match the normalized plan_name (for hospital MRF), no error is raised — rates simply aren't assigned to that network. Always verify with a research query on hospital_rates.plan_name after adding a network.

Network Types

SQL FileNetwork Typenetwork_classBuild StepNotes
network_base.sqlPPOCommercialCREATE OR REPLACECreates the table; all other steps INSERT into it
network_hmo.sqlHMOCommercialINSERT INTORegional HMO plans (e.g., Kaiser CA)
network_narrow.sqlNarrowCommercialINSERT INTONarrow/tiered networks (smaller provider panels)
network_exchange.sqlExchangeExchangeINSERT INTOACA Marketplace plans — Hospital provider type only in ROS
network_custom.sqlCustom PPO + HMOCommercialINSERT INTOCustom commercial networks not in standard PPO/HMO lists
network_medicare_advantage.sqlMedicare AdvantageMedicare AdvantageINSERT INTOnetwork_class AND network_type both = 'Medicare Advantage'
network_dedup.sqlCREATE OR REPLACESELECT DISTINCT * removes any duplicate rows from prior INSERTs

payer_data_network — the MRF join key:

The payer_data_network column is an ARRAY of exact plan name strings as they appear in payer MRF files. This is how the Raw Data phase assigns network_id to payer MRF rates.

  1. Network spine row: network_name="UHC Choice Plus PPO", payer_data_network=['CHOICE PLUS', 'CHOICE PLUS NATIONAL']
  2. Payer MRF rate: plan_name="CHOICE PLUS" for provider X, code 99213, rate $180
  3. Join condition: hospital_rates.plan_name = ANY(network_spine.payer_data_network)
Result

Rate is assigned: (provider X, payer 643, network=UHC Choice Plus PPO, code 99213, bill_type=Professional) → ROID → $180

warning

payer_data_network is for payer MRF (core_rates) only — exact match. The join is hospital_rates.plan_name = ANY(payer_data_network) with no normalization. 'CHOICE PLUS' and 'Choice Plus' are different strings. Hospital MRF (hospital_rates) uses a separate keyword matching path with normalized plan_name (ALLCAPS, no spaces) — see ppo_payer_keywords in variables_ppo.sql.

Deterministic network_id:

Because the hash is deterministic, the same (payer_id, network_name) pair always produces the same network_id — even across different CLD versions, different months, or re-runs. This makes network_id a stable foreign key you can compare across versions.

warning

Stability cuts both ways. If you rename a network (e.g., "Aetna Choice PPO" → "Aetna Choice Plus"), the network_id changes. All historical rates keyed by the old network_id become orphaned. Rename networks only intentionally, and be aware of cross-version comparison implications.

info

You can compute network_id offline. SELECT from_big_endian_64(xxhash64(CAST('643' || 'UHC Choice Plus PPO' AS varbinary))) and use that value directly in your WHERE clause.


Code Spine

All billing codes in scope with type classification, bill type assignment, and code flags. Built from 11 source-specific codesets via a UNION ALL in code.sql.

How It Works

code.sql is structured in 3 layers:

  1. Prep CTEs: Compute revenue, filter drug codes, build lab revenue ranking, pull SSP line codes.
  2. all_codes CTE: The 11-source UNION ALL — assembles (bill_type, billing_code_type, billing_code, facility) from every source.
  3. Final SELECT: Deduplicates all_codes, then LEFT JOINs in metadata: service_description, service_line, therapeutic_area, drug_name_array, is_surg_code, is_drug_code, is_lab_code, is_device_dependent.

facility flag — PG codes are duplicated:

Physician Group codes appear twice in the spine — once with facility=True and once with facility=False. This reflects the Medicare two-rate system:

  1. facility=True: The "facility rate" — lower, because the facility covers overhead. Used when a PG physician sees a patient in a hospital.
  2. facility=False: The "non-facility rate" — higher, covering physician's overhead. Used for office visits.
Result

Same HCPCS code 99214 appears twice: (Professional, HCPCS, 99214, facility=True) and (Professional, HCPCS, 99214, facility=False) — each gets its own ROID.

is_surg_code — 3 sources of truth:

A code is flagged surgical if it appears in ANY of:

  1. OPG (Outpatient Procedure Grouper) schedules from UHC, Aetna, Cigna, or BCBS CA
  2. HCPCS codes whose first digit is 1–6 (CMS numeric prefix convention for surgical procedures)
  3. MS-DRGs marked as type = 'SURG' in CMS MS-DRG weights table

APR-DRG — the SOI suffix pattern:

  1. Source: APR-DRG base codes (e.g., '302') from SPINES_SERVICES where service_type='APR-DRG'
  2. CROSS JOIN: Each code is crossed with ['-1', '-2', '-3', '-4'] to produce 4 rows
  3. Result: '302' → '302-1' (minor), '302-2' (moderate), '302-3' (major), '302-4' (extreme)
Result

~600 APR-DRG base codes × 4 SOI levels = ~2,400 APR-DRG rows in the code spine.

11 Code Sources

#SourceBill TypeProvider TypesManaged By
1All MS-DRG codes (excluding transplant + CAR-T/ECMO)InpatientHospitalssub-DAG (SPINES_SERVICES)
2APR-DRG × SOI suffix (-1 to -4)InpatientHospitals (NY/NJ markets)sub-DAG (SPINES_SERVICES)
3OUTPATIENT_CODESET (top ~5K HCPCS by Medicare revenue)OutpatientHospitalscld-utils DAG (date-stamped table)
4ASC_CODESET (ASC-payable procedure codes)OutpatientASCscld-utils DAG (date-stamped table)
5SSP line codes (Shoppable Service Packages)Outpatient + ProfessionalHospitalssub-DAG (SSP_SUB_PACKAGES)
6J1 HCPCS (OPPS status indicator J1)OutpatientHospitalssub-DAG (TMP_OPPS_MEDICARE_REFERENCE_TABLE)
7Life Sciences drug codes (ASP + OPPS G/K)Outpatient + ProfessionalAllsub-DAG (ASP_PRICING_TABLE, TMP_OPPS)
8aPG codeset — facility=TrueProfessionalPhysician Groupscld-utils DAG (date-stamped table)
8bPG codeset — facility=FalseProfessionalPhysician Groupscld-utils DAG (date-stamped table)
9Lab codeset (Labs + PGs)ProfessionalLabs + Hospitals + PGscld-utils DAG (LAB_HOSPITAL_AND_PGS_CODESET)
10Top 750 lab codes by Komodo revenueProfessionalLabssub-DAG (CLAIMS_MEDICAL_SERVICE_LINES)
11DME codesetProfessionalDME Supplierscld-utils DAG (date-stamped table)
Dialysis + Urgent Care codesetsProfessionalDialysis, Urgent Carecld-utils DAG (date-stamped tables)
info

cld-utils vs sub-DAG: Sources 3, 4, 8, 9, 11 are date-stamped tables (e.g., outpatient_codeset_2026_03_24) built by the cld-utils DAG. After a cld-utils rebuild, update the pointer in params.py. Sources 1, 2, 5, 6, 7, 10 are computed fresh each run from live reference tables.

Code Flags

FlagHow it's setPipeline impact
is_drug_codebilling_code appears in dpr_codes (ASP pricing table UNION OPPS G/K status)Drug dosage standardization in Transformations; ASP-based outlier bounds (0.8×–10× ASP); separate BRIT accuracy scoring path
is_lab_codebilling_code in LAB_HOSPITAL_AND_PGS_CODESET OR CLINICAL_LAB_REFERENCE_PRICINGLab outlier bounds (0.2×–4.5× Medicare fee schedule); lab-specific BRIT accuracy
is_surg_codeIn OPG schedules OR HCPCS prefix 1-6 OR MS-DRG type='SURG'OPG-based imputations in Imputations phase
is_device_dependentbilling_code in DEVICE_DEPENDENT_CODES tableSignals that rate may include implantable device cost
warning

MS-DRG exclusions: Transplant codes (001–017, 019) and CAR-T/ECMO codes (650–652) are excluded — too specialized, too few covered lives. MS-DRG 018 (CAR-T) IS included but restricted to designated CAR-T treatment centers via plausibility filter.

info

Drug codes are inserted twice: Life Sciences codes appear as both Outpatient and Professional bill_type. PG drug codes (in dpr_codes) are EXCLUDED from source 8 (PG codeset) to avoid double-counting — they're covered by source 7.