← Lessons

quiz vs the machine

Platinum1800

Machine Learning

The Eval Data Contamination

When test questions leak into training data, benchmark scores stop meaning anything.

6 min read · advanced · beat Platinum to climb

The silent score inflater

Data contamination happens when benchmark test items, or their answers, appear in a model training data. The model then recalls rather than reasons, and the benchmark measures memorization instead of capability. Scores look great and mean little.

How it sneaks in

  • Public benchmarks get scraped into web crawls used for pretraining.
  • Solutions and discussions of test problems are posted online.
  • Synthetic data generated from a benchmark reintroduces it indirectly.

Because training corpora are enormous, contamination is easy to cause and hard to notice.

Detecting it

  • N gram overlap between training text and test items.
  • Membership tests that check whether the model treats a test item as familiar.
  • Perturbation gaps, where accuracy drops sharply on reworded variants, hinting at memorization.

No single check is conclusive, so evidence is combined.

Guarding against it

Use held out and freshly created test sets released after a model training cutoff. Keep canonical answers out of public text, or watermark them. Report the training cutoff date and document decontamination steps. The most trustworthy result is a strong score on problems the model could not have seen.

Key idea

Data contamination turns benchmarks into memorization tests by leaking answers into training data, so trustworthy evaluation relies on fresh, held out sets created after the training cutoff and documented decontamination.

Check yourself

Answer to earn rating on the learn ladder.

1. What is eval data contamination?

2. Why does a large accuracy drop on reworded variants suggest contamination?

3. What is the most reliable guard against contamination?