Skip to main content
Version: 3.0

Automated QA & Validation

Clear Rates' Airflow DAGs run automated validation tasks between pipeline stages, not just on the final output. Each stage writes to a tmp_ table, then a dedicated QA task group asserts invariants on that table before the next stage reads it. If a validation fails, the downstream stage does not run.

This page is the methodology-oriented companion to the reference list of checks at QA → Automated Validation Tests. That page lists every check; this page explains why they exist, where in the pipeline they run, and how the bypass mechanism works.

Three Tiers of Validations

Validations are organized into three tiers of strictness. The tier controls whether the bypass_validations=true DAG conf flag can skip them.

TierPurposeBehavior under bypass_validations=true
StructuralFilter-tolerant integrity invariants (non-empty, ROID uniqueness). Always satisfiable under any valid filter.Always runs (bypass_exempt=True).
CoverageChecks that specific entities (payers, networks, providers, bill types) have rates. Assumes prod-scale data.Skipped. These are for prod builds.
SemanticValue-range checks, cross-table reconciliation, trace-based rate reproduction. Assumes prod-scale data.Skipped. These are for prod builds.

The tier split is what makes narrow-filter test runs safe: a test run with 10 providers and 10 codes can set bypass_validations=true to quiet the coverage and semantic assertions that would legitimately fail at that scale, while still getting the structural tier as a backstop. If a PR change silently empties a table or introduces duplicates, the structural tier catches it regardless of filter shape.

Where Each Tier Runs

Each arrow label names the list of validation tests that run before the downstream table is built. The lists live in qa/__init__.py and are wired into the DAG as explicit task dependencies.

Structural Tier (bypass-exempt)

These checks are cheap, run at every stage, and should always pass regardless of how narrow the filters are.

CheckAssertsApplied to
validation_table_not_emptyCOUNT(*) > 0Every stage
validation_roid_uniqueCOUNT(roid) == COUNT(DISTINCT roid)Every stage whose table has a roid column (not spines)

These catch the most common regression classes:

  • A SQL change that silently produces zero rows (not_empty).
  • A join that accidentally fans rows out, or a dedup step that drops the DISTINCT (roid_unique).

Because they're always-on, they also act as canaries for narrow-filter test runs — a PR can still fail the sub_dag if it breaks these even with bypass_validations=true.

Coverage Tier

Coverage checks assert that specific entities appear with non-NULL rates. They assume prod-scale input:

  • validation_all_payers_have_rates — every payer in the spine has at least one non-NULL rate.
  • validation_all_networks_have_rates — same, for networks.
  • validation_most_hospitals_have_rates[_non_commercial] — at least 95% of hospitals have a non-NULL rate.
  • validation_{billing_code_types,bill_types,provider_types}_have_rates — each dimension's expected values are represented.
  • ros_{billing_code_types,bill_types,payers,networks,providers,provider_types} — the rate object space itself contains the expected entities.

These run at raw and main stages (see automated_validation_tests.md for the full placement matrix).

Semantic Tier

Semantic checks reconcile values across tables or reproduce rate computations from traceability metadata:

  • non_outlier_median_canonical_rate — median of non-outlier canonical rates falls within a sanity band (currently $1,000 – $4,000).
  • non_outlier_median_canonical_percentage_of_state_avg_medicare — same for percent of state-avg Medicare (currently 1.0 – 1.7).
  • no_negative_rates, no_rates_gtr_20m, non_outlier_coverage_gtr_30_pct.
  • raw_rate_types, transform_rate_types, impute_rate_types — pull trace_raw_id from prod_combined_all, look up the source row in core_rates or hospital_rates, and re-apply the rate's transformation or imputation formula. The recomputed value must match canonical_rate within tolerance.

The *_rate_types checks are the most comprehensive — they're the reason traceability metadata is tracked at all.

The bypass_validations Flag

Pass bypass_validations: true in the DAG conf to skip the coverage and semantic tiers. Typical uses:

  • PR testing: run the sub_dag with filter_payer_ids=[76, 42], filter_codes=[...10 codes...], bypass_validations=true to exercise the critical path without the narrow filter legitimately failing coverage checks.
  • Debugging: temporarily skip a known-slow check while iterating.

Never set bypass_validations=true in prod runs. Structural-tier checks still run regardless of the flag, so the table-level invariants are always enforced.

See Also