Feature Pipeline Design

The skew problem

The most common production ML bug is training serving skew: a feature computed one way in training and a different way at serving time. The model then sees inputs it never trained on.

The feature store idea

A feature store computes features once and serves them to both paths.

Offline store large historical features for training
Online store low latency lookups for serving
Shared definitions the same code or logic produces both

Point in time correctness

When building training data, each feature must reflect only what was known at that moment. Joining current values onto past events leaks the future.

Streaming versus batch features

Batch features computed periodically, such as last 30 day spend
Streaming features updated in near real time, such as clicks in the last minute

Key idea

Define each feature once and serve it to training and inference from the same logic, with point in time correctness to prevent skew and leakage.

Feature Pipeline Design

The skew problem

The feature store idea

Point in time correctness

Streaming versus batch features

Key idea

Check yourself