Open audit and pipeline show 1%–46% test-set leakage and uneven score inflation across six popular benchmarks

Overview

Decision SnapshotReady For Pilot

Method is practical and reproducible using public Common Crawl and search APIs, but it misses non-public training sources and pays search costs.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 30%

Authors

Yucheng Li, Frank Guerin, Chenghua Lin

Links

Abstract / PDF

Why It Matters For Business

Contaminated benchmarks can inflate model metrics and lead teams to pick models that simply memorised test examples rather than truly generalising.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Founder

Summary TLDR

The authors release an open pipeline that checks whether test questions appear verbatim in Common Crawl using Bing + METEOR matching. They report contamination rates of 1.1%–45.8% across six multi-choice benchmarks (Winogrande, CommonsenseQA, HellaSwag, ARC, MMLU, C-Eval). Contamination has grown over time and can inflate accuracy (up to +14% on C-Eval and +7% on HellaSwag on contaminated subsets), but effects vary by dataset and model size. Larger models tend to benefit more from contaminated test items. The pipeline and domain analysis are shared to help teams audit and reduce leakage.

Problem Statement

Benchmark test examples often leak into models' training data, allowing memorisation to boost reported scores and misleading comparisons. Existing contamination checks are incomplete and opaque. The paper builds an open, reproducible method to audit contamination across models and benchmarks and to measure how leakage changes reported performance.

Main Contribution

An open-source pipeline that uses Bing search + Common Crawl index and METEOR matching to flag contaminated test examples.

A contamination audit across 15+ LLMs and six multi-choice QA benchmarks, reporting per-dataset contamination rates and growth over time.

Key Findings

Contamination varies strongly by benchmark.

NumbersC-Eval 45.8%; MMLU 29.1%; HellaSwag 12.4%; ARC 28.7%; CommonsenseQA 1.6%; Winogrande 1.1%

Practical UseAudit each benchmark before use; don't assume academic/OCR-sourced sets are safe.

Evidence RefTable 1, §5

Contamination increased over recent years.

NumbersUp to +21% increase for some academic benchmarks (2017-2020 → 2020-2023)

Practical UseBlocklists and one-time filters become stale; schedule regular re-checks.

Evidence RefFigure 2, §5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Contamination rate	45.8% (C-Eval dev)	—	—	C-Eval (dev)	Table 1: 616/1346 examples flagged in Common Crawl	Table 1
Contamination rate	29.1% (MMLU test)	—	—	MMLU (test)	Table 1: 4077/13987 examples flagged in Common Crawl	Table 1

What To Try In 7 Days

Run the open pipeline on your evaluation sets to flag input and input+label leakage.

Recompute model comparisons on clean-only subsets before product decisions.

Block or monitor a short list of high-leakage domains when assembling training data.

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Method relies on Bing + Common Crawl; it misses training data not present in Common Crawl or not indexed by search engines.

Search API costs and query-length limits restrict scale and prevent long-passage benchmarks from being checked.

When Not To Use

Evaluating benchmarks with very long passages that exceed search API query limits.

Auditing models trained largely on private or proprietary corpora not in Common Crawl.

Failure Modes

False negatives if training data copies are paraphrased beyond METEOR threshold.

False positives if matched pages quote test content out of context (mitigated but not eliminated by windowing and order penalty).

Core Entities

Models

LLaMALlama-2YiMistralBaichuanQwenMistral-FTLlama-2 ChatLlama-2 Chat 70BYi 6BQwen 7BBaichuan2 7B

Metrics

AccuracyMETEORperplexity

Datasets

MMLUC-EvalWinograndeCommonsenseQAHellaSwagARC

Benchmarks

MMLUC-EvalWinograndeCommonsenseQAHellaSwagARC

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Contamination varies strongly by benchmark.

Contamination increased over recent years.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding