The Eval Data Contamination

When test questions leak into training data, benchmark scores stop meaning anything.

The silent score inflater

Data contamination happens when benchmark test items, or their answers, appear in a model training data. The model then recalls rather than reasons, and the benchmark measures memorization instead of capability. Scores look great and mean little.

How it sneaks in

Public benchmarks get scraped into web crawls used for pretraining.
Solutions and discussions of test problems are posted online.
Synthetic data generated from a benchmark reintroduces it indirectly.

Because training corpora are enormous, contamination is easy to cause and hard to notice.

Detecting it

N gram overlap between training text and test items.
Membership tests that check whether the model treats a test item as familiar.
Perturbation gaps, where accuracy drops sharply on reworded variants, hinting at memorization.

No single check is conclusive, so evidence is combined.

Guarding against it

Use held out and freshly created test sets released after a model training cutoff. Keep canonical answers out of public text, or watermark them. Report the training cutoff date and document decontamination steps. The most trustworthy result is a strong score on problems the model could not have seen.

Key idea