Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.4
Citation Count
3
Why It Matters For Business
Contaminated benchmarks can inflate model metrics and lead teams to pick models that simply memorised test examples rather than truly generalising.
Summary TLDR
The authors release an open pipeline that checks whether test questions appear verbatim in Common Crawl using Bing + METEOR matching. They report contamination rates of 1.1%–45.8% across six multi-choice benchmarks (Winogrande, CommonsenseQA, HellaSwag, ARC, MMLU, C-Eval). Contamination has grown over time and can inflate accuracy (up to +14% on C-Eval and +7% on HellaSwag on contaminated subsets), but effects vary by dataset and model size. Larger models tend to benefit more from contaminated test items. The pipeline and domain analysis are shared to help teams audit and reduce leakage.
Problem Statement
Benchmark test examples often leak into models' training data, allowing memorisation to boost reported scores and misleading comparisons. Existing contamination checks are incomplete and opaque. The paper builds an open, reproducible method to audit contamination across models and benchmarks and to measure how leakage changes reported performance.
Main Contribution
An open-source pipeline that uses Bing search + Common Crawl index and METEOR matching to flag contaminated test examples.
A contamination audit across 15+ LLMs and six multi-choice QA benchmarks, reporting per-dataset contamination rates and growth over time.
Analysis of how contamination affects accuracy, showing dataset-dependent effects and that larger models often gain more.
Domain-level analysis that shows contamination concentrates in a few sites and discussion of mitigation trade-offs.
Key Findings
Contamination varies strongly by benchmark.
Contamination increased over recent years.
Contamination can inflate accuracy but effects are dataset-dependent.
Larger models tend to benefit more from contaminated data.
Most contamination is input-and-label (question+answer) rather than input-only.
Results
Contamination rate
Contamination rate
Contamination rate
Accuracy
Accuracy
Temporal increase in contamination
Who Should Care
What To Try In 7 Days
Run the open pipeline on your evaluation sets to flag input and input+label leakage.
Recompute model comparisons on clean-only subsets before product decisions.
Block or monitor a short list of high-leakage domains when assembling training data.
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Method relies on Bing + Common Crawl; it misses training data not present in Common Crawl or not indexed by search engines.
- Search API costs and query-length limits restrict scale and prevent long-passage benchmarks from being checked.
- METEOR-based matching and thresholds can still yield false positives and false negatives.
When Not To Use
- Evaluating benchmarks with very long passages that exceed search API query limits.
- Auditing models trained largely on private or proprietary corpora not in Common Crawl.
- When you need exact provenance of every training token rather than likely overlap signals.
Failure Modes
- False negatives if training data copies are paraphrased beyond METEOR threshold.
- False positives if matched pages quote test content out of context (mitigated but not eliminated by windowing and order penalty).
- Blocklists become ineffective as leaked content migrates to new domains over time.
Core Entities
Models
- LLaMA
- Llama-2
- Yi
- Mistral
- Baichuan
- Qwen
- Mistral-FT
- Llama-2 Chat
- Llama-2 Chat 70B
- Yi 6B
- Qwen 7B
- Baichuan2 7B
Metrics
- Accuracy
- METEOR
- perplexity
Datasets
- MMLU
- C-Eval
- Winogrande
- CommonsenseQA
- HellaSwag
- ARC
Benchmarks
- MMLU
- C-Eval
- Winogrande
- CommonsenseQA
- HellaSwag
- ARC

