Train Test Leakage Avoidance
Data leakage happens when information from outside the training data sneaks into the model, producing scores that look great but collapse in production. Avoiding it is essential to trustworthy evaluation.
Common sources of leakage
- Preprocessing on all data when a scaler or imputer is fit before splitting, so test statistics influence training.
- Target leakage when a feature secretly encodes the answer or is only available after the outcome is known.
- Temporal leakage when future information is used to predict the past in time series.
- Duplicate rows spanning the train and test split.
The disciplined workflow
- Split first, then fit every transform on the training set alone.
- Wrap preprocessing and modeling in a pipeline so the same steps apply consistently per fold.
- Use cross validation that performs all fitting inside each fold.
- For time series, split by time rather than randomly.
The telltale sign of leakage is validation performance that is suspiciously high or that does not survive in deployment.
Key idea
Leakage lets test information reach the model and inflates scores, so split first, fit transforms on training data only, and wrap everything in a pipeline with proper cross validation.