← Lessons

quiz vs the machine

Platinum1850

Machine Learning

Train Test Leakage Avoidance

Prevent test information from contaminating training so scores reflect real generalization.

6 min read · advanced · beat Platinum to climb

Train Test Leakage Avoidance

Data leakage happens when information from outside the training data sneaks into the model, producing scores that look great but collapse in production. Avoiding it is essential to trustworthy evaluation.

Common sources of leakage

  • Preprocessing on all data when a scaler or imputer is fit before splitting, so test statistics influence training.
  • Target leakage when a feature secretly encodes the answer or is only available after the outcome is known.
  • Temporal leakage when future information is used to predict the past in time series.
  • Duplicate rows spanning the train and test split.

The disciplined workflow

  • Split first, then fit every transform on the training set alone.
  • Wrap preprocessing and modeling in a pipeline so the same steps apply consistently per fold.
  • Use cross validation that performs all fitting inside each fold.
  • For time series, split by time rather than randomly.

The telltale sign of leakage is validation performance that is suspiciously high or that does not survive in deployment.

Key idea

Leakage lets test information reach the model and inflates scores, so split first, fit transforms on training data only, and wrap everything in a pipeline with proper cross validation.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the safest ordering to avoid preprocessing leakage?

2. What is target leakage?

3. How should time series data be split to avoid temporal leakage?