The Data Leakage Hunting

What leakage is

Data leakage is when training features contain information that would not be available at prediction time, or that derives from the target. It inflates offline scores and collapses in production.

A feature computed after the outcome leaks the future.
A feature derived from the label leaks the answer.
Fitting preprocessing on the full dataset leaks across the split.

How to hunt it

Suspiciously high accuracy is the first clue. Then trace each top feature to its source and timing.

Ask when each feature is actually known in the real timeline.
Fit scalers and encoders on training data only, then apply to test.
Watch for identifiers and timestamps that encode the target.

A leakage check

A model that looks too good usually is.

Key idea

Data leakage lets future or target derived information into features, inflating offline scores; hunt it by tracing each feature's real availability time and fitting preprocessing only on training data.

The Data Leakage Hunting

What leakage is

How to hunt it

A leakage check

Key idea

Check yourself