The Data Drift Detection Deep

What data drift means

Data drift is a change in the distribution of inputs the model receives. The relationship between input and label may be unchanged, but the inputs themselves move away from the training data, so predictions land in unfamiliar territory.

Measuring the shift

We compare a recent window of production data against a reference window from training.

Population Stability Index buckets a feature and sums weighted log ratios of frequencies.
Kolmogorov Smirnov test compares two continuous distributions for the largest gap.
Chi squared test compares categorical frequency counts.

A PSI above roughly 0.2 is a common warning threshold worth investigating.

Univariate versus multivariate

Single features can each look stable while their joint distribution shifts. Embedding based or model based detectors catch correlated drift that per feature tests miss.

Acting on drift

Drift is a warning, not proof of broken predictions. Confirm with quality metrics when labels exist.
Tune window size to balance sensitivity against noise.

Key idea