Version: 3.0

QA DAG

The core_licensable_data_qa DAG runs automated QA checks at each stage of the CLD sub-DAG build. It is triggered by the sub-DAG via TriggerDagRunOperator(wait_for_completion=False), so it runs in parallel without blocking the build.

Architecture

The sub-DAG triggers the QA DAG 5 times during a build, once per checkpoint. Each trigger passes a conf dict containing the checkpoint name, version, schema, and filter parameters. The QA DAG uses @task.short_circuit gates to run only the checks relevant to the active checkpoint.

Reports are uploaded to S3 as markdown files.

Checkpoints

Checkpoint	Triggered After	Table Checked
`rate_object_space`	`rate_object_space_validations`	`tmp_rate_object_space_{sv}`
`int_combined_raw`	`combined_raw_validations`	`tmp_int_combined_raw_{sv}`
`int_combined_brit`	`accuracy_brit_validations`	`tmp_int_combined_brit_{sv}`
`int_combined_no_whisp`	`main_no_whisp_combined_validations`	`tmp_int_combined_no_whisp_{sv}`
`int_combined`	`main_combined_validations`	`tmp_int_combined_{sv}`

Checks

rate_object_space

Row distribution — row counts by provider_type and bill_type, with distinct provider/payer/network/code counts per segment.
Spine coverage — total distinct providers, payers, networks, and codes in the rate object space.

int_combined_raw

Overview — total rows vs. rated rows by provider_type. A row is "rated" if it has at least one raw rate value (payer negotiated, hospital fee schedule, or hospital case rate).
Spine coverage — distinct entities with at least one rated row, broken down by provider/payer/network/code.
Rate family fill — percentage of rows with at least one non-null value per rate family: hospital_raw, payer_raw, nontransformed.

int_combined_brit

Same checks as int_combined_raw, plus additional rate families: hospital_transformation, payer_transformation, derived_imputation, cstm_imputation.

int_combined_no_whisp / int_combined

Score distribution — count and percentage of rows per canonical_rate_score. Flags if score=0 (no rate) exceeds 80% or if non-outlier rated rows (score >= 4) fall below 10%.
Network coverage — networks ranked by non-outlier rated row count. Flags networks with 0 non-outlier rated rows.
Rate source distribution — breakdown by canonical_method_rate_type for rated rows.
Provider & code coverage — distinct providers and codes with non-outlier rates, broken down by provider_type and billing_code_type.

S3 Reports

Reports are uploaded to:

s3://airflow-internal-teams-assets/data-science/chansoo/clear_rates_qa_reports/
  cld_{version}_{payer_filter}_{timestamp}/
    {checkpoint}_report.md

Each report includes a header with version, schema, checkpoint, provider types, and payer filter metadata.

Sub-DAG Integration

The sub-DAG triggers the QA DAG as leaf nodes off each validation TaskGroup:

qa_conf = build_qa_conf.override(task_id="qa_conf_{checkpoint}")("{checkpoint}")
trigger_qa = TriggerDagRunOperator(
    task_id="trigger_qa_{checkpoint}",
    trigger_dag_id="core_licensable_data_qa",
    wait_for_completion=False,
    conf=qa_conf,
)
validation_task_group >> qa_conf >> trigger_qa

The build_qa_conf task extracts parameters from the sub-DAG's dag_run.conf and builds the conf dict for the QA DAG.

DAG Parameters

Parameter	Description	Default
`checkpoint`	Which checkpoint to run	`rate_object_space`
`version`	CLD version suffix	`vX_X`
`sub_version`	Month suffix for table names	`2026_01`
`schema_name`	Schema prefix	`tq_dev.internal_dev_csong_cld_`
`db_conn_name`	Trino connection	`trino_default`
`provider_types`	List of provider types	all 8 types
`filter_payer_ids`	Payer ID filter	`[]` (no filter)

Code

core_licensable_data_qa/
├── __init__.py          # DAG definition + checkpoint gates
├── tasks/
│   ├── qa_checks.py     # Check functions (one per checkpoint)
│   └── report.py        # Assemble markdown + upload to S3
├── sql/                 # Jinja2 SQL templates per checkpoint
│   ├── rate_object_space/
│   ├── int_combined_raw/
│   ├── int_combined_brit/
│   ├── int_combined/
│   └── networks/
└── utils/
    ├── columns.py       # Rate column lists per checkpoint
    └── thresholds.py    # Flag thresholds

Architecture​

Checkpoints​

Checks​

rate_object_space​

int_combined_raw​

int_combined_brit​

int_combined_no_whisp / int_combined​

S3 Reports​

Sub-DAG Integration​

DAG Parameters​

Code​

On this page: