The silent score inflater
Data contamination happens when benchmark test items, or their answers, appear in a model training data. The model then recalls rather than reasons, and the benchmark measures memorization instead of capability. Scores look great and mean little.
How it sneaks in
- Public benchmarks get scraped into web crawls used for pretraining.
- Solutions and discussions of test problems are posted online.
- Synthetic data generated from a benchmark reintroduces it indirectly.
Because training corpora are enormous, contamination is easy to cause and hard to notice.
Detecting it
- N gram overlap between training text and test items.
- Membership tests that check whether the model treats a test item as familiar.
- Perturbation gaps, where accuracy drops sharply on reworded variants, hinting at memorization.
No single check is conclusive, so evidence is combined.
Guarding against it
Use held out and freshly created test sets released after a model training cutoff. Keep canonical answers out of public text, or watermark them. Report the training cutoff date and document decontamination steps. The most trustworthy result is a strong score on problems the model could not have seen.
Key idea
Data contamination turns benchmarks into memorization tests by leaking answers into training data, so trustworthy evaluation relies on fresh, held out sets created after the training cutoff and documented decontamination.