← Lessons

quiz vs the machine

Gold1480

Machine Learning

The Data Leakage Hunting

Find information that sneaks from the future or the target into your features.

6 min read · core · beat Gold to climb

What leakage is

Data leakage is when training features contain information that would not be available at prediction time, or that derives from the target. It inflates offline scores and collapses in production.

  • A feature computed after the outcome leaks the future.
  • A feature derived from the label leaks the answer.
  • Fitting preprocessing on the full dataset leaks across the split.

How to hunt it

Suspiciously high accuracy is the first clue. Then trace each top feature to its source and timing.

  • Ask when each feature is actually known in the real timeline.
  • Fit scalers and encoders on training data only, then apply to test.
  • Watch for identifiers and timestamps that encode the target.

A leakage check

A model that looks too good usually is.

Key idea

Data leakage lets future or target derived information into features, inflating offline scores; hunt it by tracing each feature's real availability time and fitting preprocessing only on training data.

Check yourself

Answer to earn rating on the learn ladder.

1. What defines data leakage?

2. Why must scalers and encoders be fit on training data only?

3. What is a common first clue of leakage?