QA DAG
The core_licensable_data_qa DAG runs automated QA checks at each stage of the CLD sub-DAG build. It is triggered by the sub-DAG via TriggerDagRunOperator(wait_for_completion=False), so it runs in parallel without blocking the build.
Architecture
The sub-DAG triggers the QA DAG 5 times during a build, once per checkpoint. Each trigger passes a conf dict containing the checkpoint name, version, schema, and filter parameters. The QA DAG uses @task.short_circuit gates to run only the checks relevant to the active checkpoint.
Reports are uploaded to S3 as markdown files.
Checkpoints
| Checkpoint | Triggered After | Table Checked |
|---|---|---|
rate_object_space | rate_object_space_validations | tmp_rate_object_space_{sv} |
int_combined_raw | combined_raw_validations | tmp_int_combined_raw_{sv} |
int_combined_brit | accuracy_brit_validations | tmp_int_combined_brit_{sv} |
int_combined_no_whisp | main_no_whisp_combined_validations | tmp_int_combined_no_whisp_{sv} |
int_combined | main_combined_validations | tmp_int_combined_{sv} |
Checks
rate_object_space
- Row distribution — row counts by
provider_typeandbill_type, with distinct provider/payer/network/code counts per segment. - Spine coverage — total distinct providers, payers, networks, and codes in the rate object space.
int_combined_raw
- Overview — total rows vs. rated rows by
provider_type. A row is "rated" if it has at least one raw rate value (payer negotiated, hospital fee schedule, or hospital case rate). - Spine coverage — distinct entities with at least one rated row, broken down by provider/payer/network/code.
- Rate family fill — percentage of rows with at least one non-null value per rate family:
hospital_raw,payer_raw,nontransformed.
int_combined_brit
Same checks as int_combined_raw, plus additional rate families: hospital_transformation, payer_transformation, derived_imputation, cstm_imputation.
int_combined_no_whisp / int_combined
- Score distribution — count and percentage of rows per
canonical_rate_score. Flags if score=0 (no rate) exceeds 80% or if non-outlier rated rows (score >= 4) fall below 10%. - Network coverage — networks ranked by non-outlier rated row count. Flags networks with 0 non-outlier rated rows.
- Rate source distribution — breakdown by
canonical_method_rate_typefor rated rows. - Provider & code coverage — distinct providers and codes with non-outlier rates, broken down by
provider_typeandbilling_code_type.
S3 Reports
Reports are uploaded to:
s3://airflow-internal-teams-assets/data-science/chansoo/clear_rates_qa_reports/
cld_{version}_{payer_filter}_{timestamp}/
{checkpoint}_report.md
Each report includes a header with version, schema, checkpoint, provider types, and payer filter metadata.
Sub-DAG Integration
The sub-DAG triggers the QA DAG as leaf nodes off each validation TaskGroup:
qa_conf = build_qa_conf.override(task_id="qa_conf_{checkpoint}")("{checkpoint}")
trigger_qa = TriggerDagRunOperator(
task_id="trigger_qa_{checkpoint}",
trigger_dag_id="core_licensable_data_qa",
wait_for_completion=False,
conf=qa_conf,
)
validation_task_group >> qa_conf >> trigger_qa
The build_qa_conf task extracts parameters from the sub-DAG's dag_run.conf and builds the conf dict for the QA DAG.
DAG Parameters
| Parameter | Description | Default |
|---|---|---|
checkpoint | Which checkpoint to run | rate_object_space |
version | CLD version suffix | vX_X |
sub_version | Month suffix for table names | 2026_01 |
schema_name | Schema prefix | tq_dev.internal_dev_csong_cld_ |
db_conn_name | Trino connection | trino_default |
provider_types | List of provider types | all 8 types |
filter_payer_ids | Payer ID filter | [] (no filter) |
Code
core_licensable_data_qa/
├── __init__.py # DAG definition + checkpoint gates
├── tasks/
│ ├── qa_checks.py # Check functions (one per checkpoint)
│ └── report.py # Assemble markdown + upload to S3
├── sql/ # Jinja2 SQL templates per checkpoint
│ ├── rate_object_space/
│ ├── int_combined_raw/
│ ├── int_combined_brit/
│ ├── int_combined/
│ └── networks/
└── utils/
├── columns.py # Rate column lists per checkpoint
└── thresholds.py # Flag thresholds