← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Data Drift Detection Deep

Spotting when incoming inputs no longer resemble the data the model trained on.

5 min read · intro · beat Silver to climb

What data drift means

Data drift is a change in the distribution of inputs the model receives. The relationship between input and label may be unchanged, but the inputs themselves move away from the training data, so predictions land in unfamiliar territory.

Measuring the shift

We compare a recent window of production data against a reference window from training.

  • Population Stability Index buckets a feature and sums weighted log ratios of frequencies.
  • Kolmogorov Smirnov test compares two continuous distributions for the largest gap.
  • Chi squared test compares categorical frequency counts.

A PSI above roughly 0.2 is a common warning threshold worth investigating.

Univariate versus multivariate

Single features can each look stable while their joint distribution shifts. Embedding based or model based detectors catch correlated drift that per feature tests miss.

Acting on drift

  • Drift is a warning, not proof of broken predictions. Confirm with quality metrics when labels exist.
  • Tune window size to balance sensitivity against noise.

Key idea

Data drift is an input distribution change measured by comparing production windows to a reference with tests like PSI or KS, and it signals investigation rather than guaranteed failure.

Check yourself

Answer to earn rating on the learn ladder.

1. What does data drift specifically refer to?

2. Why can per feature drift tests miss real problems?

3. A PSI value well above 0.2 most directly suggests what?