Automated QA & Validation
Clear Rates' Airflow DAGs run automated validation tasks between pipeline
stages, not just on the final output. Each stage writes to a tmp_ table,
then a dedicated QA task group asserts invariants on that table before the
next stage reads it. If a validation fails, the downstream stage does not
run.
This page is the methodology-oriented companion to the reference list of checks at QA → Automated Validation Tests. That page lists every check; this page explains why they exist, where in the pipeline they run, and how the bypass mechanism works.
Three Tiers of Validations
Validations are organized into three tiers of strictness. The tier controls
whether the bypass_validations=true DAG conf flag can skip them.
| Tier | Purpose | Behavior under bypass_validations=true |
|---|---|---|
| Structural | Filter-tolerant integrity invariants (non-empty, ROID uniqueness). Always satisfiable under any valid filter. | Always runs (bypass_exempt=True). |
| Coverage | Checks that specific entities (payers, networks, providers, bill types) have rates. Assumes prod-scale data. | Skipped. These are for prod builds. |
| Semantic | Value-range checks, cross-table reconciliation, trace-based rate reproduction. Assumes prod-scale data. | Skipped. These are for prod builds. |
The tier split is what makes narrow-filter test runs safe: a test run with
10 providers and 10 codes can set bypass_validations=true to quiet the
coverage and semantic assertions that would legitimately fail at that scale,
while still getting the structural tier as a backstop. If a PR change
silently empties a table or introduces duplicates, the structural tier
catches it regardless of filter shape.
Where Each Tier Runs
Each arrow label names the list of validation tests that run before
the downstream table is built. The lists live in
qa/__init__.py
and are wired into the DAG as explicit task dependencies.
Structural Tier (bypass-exempt)
These checks are cheap, run at every stage, and should always pass regardless of how narrow the filters are.
| Check | Asserts | Applied to |
|---|---|---|
validation_table_not_empty | COUNT(*) > 0 | Every stage |
validation_roid_unique | COUNT(roid) == COUNT(DISTINCT roid) | Every stage whose table has a roid column (not spines) |
These catch the most common regression classes:
- A SQL change that silently produces zero rows (
not_empty). - A join that accidentally fans rows out, or a dedup step that drops the
DISTINCT(roid_unique).
Because they're always-on, they also act as canaries for narrow-filter
test runs — a PR can still fail the sub_dag if it breaks these even with
bypass_validations=true.
Coverage Tier
Coverage checks assert that specific entities appear with non-NULL rates. They assume prod-scale input:
validation_all_payers_have_rates— every payer in the spine has at least one non-NULL rate.validation_all_networks_have_rates— same, for networks.validation_most_hospitals_have_rates[_non_commercial]— at least 95% of hospitals have a non-NULL rate.validation_{billing_code_types,bill_types,provider_types}_have_rates— each dimension's expected values are represented.ros_{billing_code_types,bill_types,payers,networks,providers,provider_types}— the rate object space itself contains the expected entities.
These run at raw and main stages (see
automated_validation_tests.md for
the full placement matrix).
Semantic Tier
Semantic checks reconcile values across tables or reproduce rate computations from traceability metadata:
non_outlier_median_canonical_rate— median of non-outlier canonical rates falls within a sanity band (currently $1,000 – $4,000).non_outlier_median_canonical_percentage_of_state_avg_medicare— same for percent of state-avg Medicare (currently 1.0 – 1.7).no_negative_rates,no_rates_gtr_20m,non_outlier_coverage_gtr_30_pct.raw_rate_types,transform_rate_types,impute_rate_types— pulltrace_raw_idfromprod_combined_all, look up the source row incore_ratesorhospital_rates, and re-apply the rate's transformation or imputation formula. The recomputed value must matchcanonical_ratewithin tolerance.
The *_rate_types checks are the most comprehensive — they're the reason
traceability metadata is tracked at all.
The bypass_validations Flag
Pass bypass_validations: true in the DAG conf to skip the coverage and
semantic tiers. Typical uses:
- PR testing: run the sub_dag with
filter_payer_ids=[76, 42],filter_codes=[...10 codes...],bypass_validations=trueto exercise the critical path without the narrow filter legitimately failing coverage checks. - Debugging: temporarily skip a known-slow check while iterating.
Never set bypass_validations=true in prod runs. Structural-tier checks
still run regardless of the flag, so the table-level invariants are
always enforced.
See Also
- QA → Automated Validation Tests — the reference list of every check, keyed by table.
- Accuracy → Score Hierarchy — how accuracy scores interact with the rate-selection invariant the QA checks assume.
- Output → Traceability — the trace columns that the
*_rate_typesreproduction checks rely on.