Skip to main content
Version: 3.0

QA DAG

The core_licensable_data_qa DAG runs automated QA checks at each stage of the CLD sub-DAG build. It is triggered by the sub-DAG via TriggerDagRunOperator(wait_for_completion=False), so it runs in parallel without blocking the build.

Architecture

The sub-DAG triggers the QA DAG 5 times during a build, once per checkpoint. Each trigger passes a conf dict containing the checkpoint name, version, schema, and filter parameters. The QA DAG uses @task.short_circuit gates to run only the checks relevant to the active checkpoint.

Reports are uploaded to S3 as markdown files.

Checkpoints

CheckpointTriggered AfterTable Checked
rate_object_spacerate_object_space_validationstmp_rate_object_space_{sv}
int_combined_rawcombined_raw_validationstmp_int_combined_raw_{sv}
int_combined_britaccuracy_brit_validationstmp_int_combined_brit_{sv}
int_combined_no_whispmain_no_whisp_combined_validationstmp_int_combined_no_whisp_{sv}
int_combinedmain_combined_validationstmp_int_combined_{sv}

Checks

rate_object_space

  • Row distribution — row counts by provider_type and bill_type, with distinct provider/payer/network/code counts per segment.
  • Spine coverage — total distinct providers, payers, networks, and codes in the rate object space.

int_combined_raw

  • Overview — total rows vs. rated rows by provider_type. A row is "rated" if it has at least one raw rate value (payer negotiated, hospital fee schedule, or hospital case rate).
  • Spine coverage — distinct entities with at least one rated row, broken down by provider/payer/network/code.
  • Rate family fill — percentage of rows with at least one non-null value per rate family: hospital_raw, payer_raw, nontransformed.

int_combined_brit

Same checks as int_combined_raw, plus additional rate families: hospital_transformation, payer_transformation, derived_imputation, cstm_imputation.

int_combined_no_whisp / int_combined

  • Score distribution — count and percentage of rows per canonical_rate_score. Flags if score=0 (no rate) exceeds 80% or if non-outlier rated rows (score >= 4) fall below 10%.
  • Network coverage — networks ranked by non-outlier rated row count. Flags networks with 0 non-outlier rated rows.
  • Rate source distribution — breakdown by canonical_method_rate_type for rated rows.
  • Provider & code coverage — distinct providers and codes with non-outlier rates, broken down by provider_type and billing_code_type.

S3 Reports

Reports are uploaded to:

s3://airflow-internal-teams-assets/data-science/chansoo/clear_rates_qa_reports/
cld_{version}_{payer_filter}_{timestamp}/
{checkpoint}_report.md

Each report includes a header with version, schema, checkpoint, provider types, and payer filter metadata.

Sub-DAG Integration

The sub-DAG triggers the QA DAG as leaf nodes off each validation TaskGroup:

qa_conf = build_qa_conf.override(task_id="qa_conf_{checkpoint}")("{checkpoint}")
trigger_qa = TriggerDagRunOperator(
task_id="trigger_qa_{checkpoint}",
trigger_dag_id="core_licensable_data_qa",
wait_for_completion=False,
conf=qa_conf,
)
validation_task_group >> qa_conf >> trigger_qa

The build_qa_conf task extracts parameters from the sub-DAG's dag_run.conf and builds the conf dict for the QA DAG.

DAG Parameters

ParameterDescriptionDefault
checkpointWhich checkpoint to runrate_object_space
versionCLD version suffixvX_X
sub_versionMonth suffix for table names2026_01
schema_nameSchema prefixtq_dev.internal_dev_csong_cld_
db_conn_nameTrino connectiontrino_default
provider_typesList of provider typesall 8 types
filter_payer_idsPayer ID filter[] (no filter)

Code

core_licensable_data_qa/
├── __init__.py # DAG definition + checkpoint gates
├── tasks/
│ ├── qa_checks.py # Check functions (one per checkpoint)
│ └── report.py # Assemble markdown + upload to S3
├── sql/ # Jinja2 SQL templates per checkpoint
│ ├── rate_object_space/
│ ├── int_combined_raw/
│ ├── int_combined_brit/
│ ├── int_combined/
│ └── networks/
└── utils/
├── columns.py # Rate column lists per checkpoint
└── thresholds.py # Flag thresholds