← Lessons

quiz vs the machine

Platinum1750

Machine Learning

Data leakage: the silent killer

Why your 99% accuracy is probably a bug.

6 min read · advanced · beat Platinum to climb

What leakage is

Data leakage is when information from outside the training set sneaks into the model — so it learns something it won't have at prediction time. The result: amazing offline metrics that collapse in production.

Classic sources

  • Preprocessing before splitting — fitting a scaler/imputer on the whole dataset leaks test statistics.
  • Target leakage — a feature that's a proxy for the label (e.g. "account_closed_date" predicting churn).
  • Temporal leakage — training on future data to predict the past.

The fix

Split first, then fit every transform on the training fold only and apply it to validation/test. Use pipelines so this is automatic, and always ask: "would this feature actually be available at prediction time?"

Key idea

If a result looks too good, suspect leakage before you celebrate.

Check yourself

Answer to earn rating on the learn ladder.

1. You fit your StandardScaler on the entire dataset, then split. What's wrong?

2. A churn model uses 'account_closed_date' as a feature. This is…