Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
16
Why It Matters For Business
Contaminated training data can make models look better on paper but worse in real tasks; check overlap and report contamination to avoid bad product decisions.
Summary TLDR
The paper shows that if evaluation data (training sets, test prompts, or test examples) leaks into model training, reported benchmark scores can jump dramatically without real capability gains. In controlled experiments, repeatedly training small LLMs on leaked benchmark data raised accuracy by tens of points on many tasks and let 1–3B models beat much larger models on benchmarks. Leakage also harms unrelated tasks (summarization, code) and reduces gains from later instruction tuning. The authors recommend systematic contamination checks (e.g., 13-gram overlap), publishing overlap reports, and averaging results across multiple prompts.
Problem Statement
Public benchmarks are used to claim LLM progress, but pretraining or fine-tuning can accidentally include benchmark data. That contamination inflates scores, breaks zero/few-shot assumptions, misleads comparisons and leaderboards, and can harm real-world performance when models are later adapted.
Main Contribution
Defined three practical leakage modes: training-set leakage, test-prompt leakage, and full test-set+prompt leakage.
Empirical study: continually training 1.3B–7B models on leaked benchmark data and measuring effects across MMLU, QA, reasoning, reading comprehension, summarization, and code tasks.
Observed side effects: inflated benchmark scores, worse performance on unrelated tasks, and reduced adaptation benefits from instruction tuning.
Practical checklist of recommendations for model developers and benchmark maintainers (data decontamination, report overlaps, multiple prompts, contamination reports).
Key Findings
Leaking training & test data greatly inflates benchmark scores.
Even small models can outperform much larger ones when leaked data is used.
Leakage can reduce performance on unrelated real tasks.
Leakage weakens later adaptation gains from instruction tuning.
Leaked test prompts alone provide a big advantage.
Results
Accuracy
Accuracy
XSum (ROUGE-L)
HumanEval (pass@10)
Accuracy
Who Should Care
What To Try In 7 Days
Run a 13-gram overlap check between your pretraining/finetuning data and common benchmarks.
Require contamination-overlap statistics as part of model evaluation reports.
When comparing models, evaluate on at least three diverse benchmarks including generation and code tasks.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Experiments use continual training on benchmarks rather than injecting contamination into full pretraining; full-pretraining effects may differ.
- Did not explore partial-leakage fractions or label-less leaks; leakage proportion effects are untested.
- No systematic measurement of contamination degrees between mainstream pretraining corpora and benchmarks.
When Not To Use
- To claim general LLM improvements when training data overlap with evaluated benchmarks is unknown.
- As the sole evidence of model capability without cross-task validation.
Failure Modes
- Undetected data contamination leads to inflated benchmark scores and misleading model selection.
- Overfitting to benchmark style reduces performance on unrelated tasks and harms adaptation.
- Prompt-sensitive evaluations can be gamed if prompts leak into training.
Core Entities
Models
- GPT-Neo-1.3B
- phi-1.5 (1.3B)
- OpenLLaMA-3B
- LLaMA-2-7B
- LLaMA-13B
- LLaMA-30B
- LLaMA-65B
Metrics
- Accuracy
- ROUGE-L
- pass@10
- zero-shot
- few-shot
Datasets
- MMLU
- BoolQ
- PIQA
- HellaSwag
- WinoGrande
- ARC-Easy
- ARC-Challenge
- OpenBookQA
- CommonsenseQA
- GSM8k
- AQuA
- RACE-Middle
- RACE-High
- CoQA
- CMRC2018
- C3-Dialog
- LAMBADA
- XSum
- HumanEval
Benchmarks
- MMLU
- Big-Bench
- AGIEval
- OpenCompass
- C-Eval

