Data Validation and Schemas
Garbage data produces garbage predictions, often with no error message. Data validation checks incoming data against expectations before it trains or serves a model.
What a schema declares
A schema describes the expected shape of each feature:
- The type, such as integer or string.
- The allowed range or set of valid categories.
- Whether a value may be missing.
Validation in action
At each pipeline run the data is checked against the schema. Violations raise alerts:
- A numeric feature suddenly full of nulls signals an upstream outage.
- A new unseen category may mean the source system changed.
- A value drifting far outside its historical range hints at a unit change or bug.
Schema evolution
Schemas are not frozen. As products change, features legitimately gain new categories or shift ranges. The goal is to distinguish expected evolution from real breakage, so teams review and update schemas deliberately rather than silently widening them to make alerts disappear.
Key idea
A schema encodes expected types, ranges, and categories so validation catches broken data before it silently corrupts a model.