← Lessons

quiz vs the machine

Gold1390

Machine Learning

Data Validation and Schemas

Catching bad data with expectations before it reaches the model.

5 min read · core · beat Gold to climb

Data Validation and Schemas

Garbage data produces garbage predictions, often with no error message. Data validation checks incoming data against expectations before it trains or serves a model.

What a schema declares

A schema describes the expected shape of each feature:

  • The type, such as integer or string.
  • The allowed range or set of valid categories.
  • Whether a value may be missing.

Validation in action

At each pipeline run the data is checked against the schema. Violations raise alerts:

  • A numeric feature suddenly full of nulls signals an upstream outage.
  • A new unseen category may mean the source system changed.
  • A value drifting far outside its historical range hints at a unit change or bug.

Schema evolution

Schemas are not frozen. As products change, features legitimately gain new categories or shift ranges. The goal is to distinguish expected evolution from real breakage, so teams review and update schemas deliberately rather than silently widening them to make alerts disappear.

Key idea

A schema encodes expected types, ranges, and categories so validation catches broken data before it silently corrupts a model.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a feature schema typically declare?

2. Why review schema changes instead of auto widening them?