What data drift means
Data drift is a change in the distribution of inputs the model receives. The relationship between input and label may be unchanged, but the inputs themselves move away from the training data, so predictions land in unfamiliar territory.
Measuring the shift
We compare a recent window of production data against a reference window from training.
- Population Stability Index buckets a feature and sums weighted log ratios of frequencies.
- Kolmogorov Smirnov test compares two continuous distributions for the largest gap.
- Chi squared test compares categorical frequency counts.
A PSI above roughly 0.2 is a common warning threshold worth investigating.
Univariate versus multivariate
Single features can each look stable while their joint distribution shifts. Embedding based or model based detectors catch correlated drift that per feature tests miss.
Acting on drift
- Drift is a warning, not proof of broken predictions. Confirm with quality metrics when labels exist.
- Tune window size to balance sensitivity against noise.
Key idea
Data drift is an input distribution change measured by comparing production windows to a reference with tests like PSI or KS, and it signals investigation rather than guaranteed failure.