What data drift is
Data drift, also called covariate drift, is a change in the distribution of model inputs over time relative to the training data. The relationship between inputs and labels may stay the same, but the inputs the model now sees differ, which can quietly erode performance.
How to detect it
- Compare the current input distribution to a reference window, feature by feature.
- Use a distance or test statistic such as population stability index, Kullback Leibler divergence, or a Kolmogorov Smirnov test.
- Alert when a feature's drift score crosses a tuned threshold.
Why labels are not required
Data drift watches inputs only, so it works immediately without waiting for ground truth labels, which often arrive late or never. This makes it the earliest available warning signal.
Reading the signal
Drift is a warning, not a verdict. Some drift is harmless and some breaks the model. Pair drift alerts with performance monitoring to decide whether retraining is warranted, and watch for seasonal patterns that look like drift but are expected.
Key idea
Data drift detection compares live input distributions to a training reference using statistical distances, giving an early label free warning that inputs have shifted.