What leakage is
Data leakage is when training features contain information that would not be available at prediction time, or that derives from the target. It inflates offline scores and collapses in production.
- A feature computed after the outcome leaks the future.
- A feature derived from the label leaks the answer.
- Fitting preprocessing on the full dataset leaks across the split.
How to hunt it
Suspiciously high accuracy is the first clue. Then trace each top feature to its source and timing.
- Ask when each feature is actually known in the real timeline.
- Fit scalers and encoders on training data only, then apply to test.
- Watch for identifiers and timestamps that encode the target.
A leakage check
A model that looks too good usually is.
Key idea
Data leakage lets future or target derived information into features, inflating offline scores; hunt it by tracing each feature's real availability time and fitting preprocessing only on training data.