Overview
The paper provides controlled continual-training experiments across multiple models and benchmarks showing consistent inflation and side effects; results are strong for the studied settings but do not include full pretraining contamination scenarios.
Citations16
Evidence Strength0.80
Confidence0.90
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 40%
Why It Matters For Business
Contaminated training data can make models look better on paper but worse in real tasks; check overlap and report contamination to avoid bad product decisions.
Who Should Care
Summary TLDR
The paper shows that if evaluation data (training sets, test prompts, or test examples) leaks into model training, reported benchmark scores can jump dramatically without real capability gains. In controlled experiments, repeatedly training small LLMs on leaked benchmark data raised accuracy by tens of points on many tasks and let 1–3B models beat much larger models on benchmarks. Leakage also harms unrelated tasks (summarization, code) and reduces gains from later instruction tuning. The authors recommend systematic contamination checks (e.g., 13-gram overlap), publishing overlap reports, and averaging results across multiple prompts.
Problem Statement
Public benchmarks are used to claim LLM progress, but pretraining or fine-tuning can accidentally include benchmark data. That contamination inflates scores, breaks zero/few-shot assumptions, misleads comparisons and leaderboards, and can harm real-world performance when models are later adapted.
Main Contribution
Defined three practical leakage modes: training-set leakage, test-prompt leakage, and full test-set+prompt leakage.
Empirical study: continually training 1.3B–7B models on leaked benchmark data and measuring effects across MMLU, QA, reasoning, reading comprehension, summarization, and code tasks.
Key Findings
Leaking training & test data greatly inflates benchmark scores.
Even small models can outperform much larger ones when leaked data is used.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | phi-1.5: 42.87 -> 75.05 (None -> All Train+Test P&S) | 42.87 (phi-1.5 none) | +32.18 | MMLU | Table 1 row for phi-1.5 | Table 1 |
| Accuracy | LLaMA-2 (7B): 42.95 -> 96.34 (None -> All Train+Test P&S) | 42.95 (LLaMA-2 none) | +53.39 | MMLU | Table 1 row for LLaMA-2 7B | Table 1 |
What To Try In 7 Days
Run a 13-gram overlap check between your pretraining/finetuning data and common benchmarks.
Require contamination-overlap statistics as part of model evaluation reports.
When comparing models, evaluate on at least three diverse benchmarks including generation and code tasks.
Reproducibility
Risks & Boundaries
Limitations
Experiments use continual training on benchmarks rather than injecting contamination into full pretraining; full-pretraining effects may differ.
Did not explore partial-leakage fractions or label-less leaks; leakage proportion effects are untested.
When Not To Use
To claim general LLM improvements when training data overlap with evaluated benchmarks is unknown.
As the sole evidence of model capability without cross-task validation.
Failure Modes
Undetected data contamination leads to inflated benchmark scores and misleading model selection.
Overfitting to benchmark style reduces performance on unrelated tasks and harms adaptation.

