Data Quality Checks

Guarding the pipeline

Silent bad data is worse than a loud failure because wrong dashboards erode trust. Data quality checks are automated assertions that run in the pipeline and stop or alert when data violates expectations.

What to check

Freshness: did the latest partition arrive on time.
Completeness: are row counts within an expected range, with no missing days.
Validity: do values match types, ranges, and allowed sets, for example no negative prices.
Uniqueness: are primary keys actually unique.
Referential integrity: do foreign keys point to existing rows.

Where to enforce

Place checks at layer boundaries, such as between silver and gold, so bad data does not flow downstream. A common pattern is write audit publish: write to a staging area, run audits, and only publish if they pass.

Acting on failures

Block the publish for critical checks.
Warn and continue for soft anomalies.
Track results over time to spot slow drift.

Key idea

Automated freshness completeness and validity checks at layer boundaries catch bad data before it reaches consumers.

Data Quality Checks

Guarding the pipeline

What to check

Where to enforce

Acting on failures

Key idea

Check yourself