Data leakage: the silent killer

What leakage is

Data leakage is when information from outside the training set sneaks into the model — so it learns something it won't have at prediction time. The result: amazing offline metrics that collapse in production.

Classic sources

Preprocessing before splitting — fitting a scaler/imputer on the whole dataset leaks test statistics.
Target leakage — a feature that's a proxy for the label (e.g. "account_closed_date" predicting churn).
Temporal leakage — training on future data to predict the past.

The fix

Split first, then fit every transform on the training fold only and apply it to validation/test. Use pipelines so this is automatic, and always ask: "would this feature actually be available at prediction time?"

Key idea

If a result looks too good, suspect leakage before you celebrate.

Data leakage: the silent killer

What leakage is

Classic sources

The fix

Key idea

Check yourself