← Lessons

quiz vs the machine

Platinum1750

System Design

Data Quality Checks

Asserting freshness completeness and validity so bad data is caught early.

5 min read · advanced · beat Platinum to climb

Guarding the pipeline

Silent bad data is worse than a loud failure because wrong dashboards erode trust. Data quality checks are automated assertions that run in the pipeline and stop or alert when data violates expectations.

What to check

  • Freshness: did the latest partition arrive on time.
  • Completeness: are row counts within an expected range, with no missing days.
  • Validity: do values match types, ranges, and allowed sets, for example no negative prices.
  • Uniqueness: are primary keys actually unique.
  • Referential integrity: do foreign keys point to existing rows.

Where to enforce

Place checks at layer boundaries, such as between silver and gold, so bad data does not flow downstream. A common pattern is write audit publish: write to a staging area, run audits, and only publish if they pass.

Acting on failures

  • Block the publish for critical checks.
  • Warn and continue for soft anomalies.
  • Track results over time to spot slow drift.

Key idea

Automated freshness completeness and validity checks at layer boundaries catch bad data before it reaches consumers.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a freshness check verify?

2. What does the write audit publish pattern do?

3. Why catch bad data at layer boundaries?