What leakage is
Data leakage is when information from outside the training set sneaks into the model — so it learns something it won't have at prediction time. The result: amazing offline metrics that collapse in production.
Classic sources
- Preprocessing before splitting — fitting a scaler/imputer on the whole dataset leaks test statistics.
- Target leakage — a feature that's a proxy for the label (e.g. "account_closed_date" predicting churn).
- Temporal leakage — training on future data to predict the past.
The fix
Split first, then fit every transform on the training fold only and apply it to validation/test. Use pipelines so this is automatic, and always ask: "would this feature actually be available at prediction time?"
Key idea
If a result looks too good, suspect leakage before you celebrate.