Guarding the pipeline
Silent bad data is worse than a loud failure because wrong dashboards erode trust. Data quality checks are automated assertions that run in the pipeline and stop or alert when data violates expectations.
What to check
- Freshness: did the latest partition arrive on time.
- Completeness: are row counts within an expected range, with no missing days.
- Validity: do values match types, ranges, and allowed sets, for example no negative prices.
- Uniqueness: are primary keys actually unique.
- Referential integrity: do foreign keys point to existing rows.
Where to enforce
Place checks at layer boundaries, such as between silver and gold, so bad data does not flow downstream. A common pattern is write audit publish: write to a staging area, run audits, and only publish if they pass.
Acting on failures
- Block the publish for critical checks.
- Warn and continue for soft anomalies.
- Track results over time to spot slow drift.
Key idea
Automated freshness completeness and validity checks at layer boundaries catch bad data before it reaches consumers.