The Feature Pipeline
Models do not eat raw logs. A feature pipeline is the code that turns raw records into the clean numeric and categorical inputs a model expects.
Typical stages
- Cleaning removes duplicates, fixes types, and handles missing values.
- Transformation derives features such as ratios, counts in a time window, or text embeddings.
- Encoding converts categories into numbers the model can use.
- Assembly joins everything into a single feature vector per example.
Reuse and consistency
The same pipeline must run during training and during serving. If they diverge, the model sees inputs at serving time that differ from what it learned, a problem called train serve skew. Many teams package transformation logic so the identical code runs in both places.
Pipelines as first class artifacts
A good pipeline is deterministic and versioned like a model. Given the same input it produces the same features. Treating it casually, as ad hoc notebook code, is a common source of bugs that only appear in production where no one is watching the math.
Key idea
A feature pipeline deterministically converts raw data into model inputs and must run identically in training and serving.