Open audit and pipeline show 1%–46% test-set leakage and uneven score inflation across six popular benchmarks

October 26, 20237 min

Overview

Decision SnapshotReady For Pilot

Method is practical and reproducible using public Common Crawl and search APIs, but it misses non-public training sources and pays search costs.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 30%

Authors

Yucheng Li, Frank Guerin, Chenghua Lin

Links

Abstract / PDF

Why It Matters For Business

Contaminated benchmarks can inflate model metrics and lead teams to pick models that simply memorised test examples rather than truly generalising.

Who Should Care

Summary TLDR

The authors release an open pipeline that checks whether test questions appear verbatim in Common Crawl using Bing + METEOR matching. They report contamination rates of 1.1%–45.8% across six multi-choice benchmarks (Winogrande, CommonsenseQA, HellaSwag, ARC, MMLU, C-Eval). Contamination has grown over time and can inflate accuracy (up to +14% on C-Eval and +7% on HellaSwag on contaminated subsets), but effects vary by dataset and model size. Larger models tend to benefit more from contaminated test items. The pipeline and domain analysis are shared to help teams audit and reduce leakage.

Problem Statement

Benchmark test examples often leak into models' training data, allowing memorisation to boost reported scores and misleading comparisons. Existing contamination checks are incomplete and opaque. The paper builds an open, reproducible method to audit contamination across models and benchmarks and to measure how leakage changes reported performance.

Main Contribution

An open-source pipeline that uses Bing search + Common Crawl index and METEOR matching to flag contaminated test examples.

A contamination audit across 15+ LLMs and six multi-choice QA benchmarks, reporting per-dataset contamination rates and growth over time.

Key Findings

Contamination varies strongly by benchmark.

NumbersC-Eval 45.8%; MMLU 29.1%; HellaSwag 12.4%; ARC 28.7%; CommonsenseQA 1.6%; Winogrande 1.1%

Practical UseAudit each benchmark before use; don't assume academic/OCR-sourced sets are safe.

Evidence RefTable 1, §5

Contamination increased over recent years.

NumbersUp to +21% increase for some academic benchmarks (2017-20202020-2023)

Practical UseBlocklists and one-time filters become stale; schedule regular re-checks.

Evidence RefFigure 2, §5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Contamination rate45.8% (C-Eval dev)C-Eval (dev)Table 1: 616/1346 examples flagged in Common CrawlTable 1
Contamination rate29.1% (MMLU test)MMLU (test)Table 1: 4077/13987 examples flagged in Common CrawlTable 1

What To Try In 7 Days

Run the open pipeline on your evaluation sets to flag input and input+label leakage.

Recompute model comparisons on clean-only subsets before product decisions.

Block or monitor a short list of high-leakage domains when assembling training data.

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Method relies on Bing + Common Crawl; it misses training data not present in Common Crawl or not indexed by search engines.

Search API costs and query-length limits restrict scale and prevent long-passage benchmarks from being checked.

When Not To Use

Evaluating benchmarks with very long passages that exceed search API query limits.

Auditing models trained largely on private or proprietary corpora not in Common Crawl.

Failure Modes

False negatives if training data copies are paraphrased beyond METEOR threshold.

False positives if matched pages quote test content out of context (mitigated but not eliminated by windowing and order penalty).

Core Entities

Models

LLaMALlama-2YiMistralBaichuanQwenMistral-FTLlama-2 ChatLlama-2 Chat 70BYi 6BQwen 7BBaichuan2 7B

Metrics

AccuracyMETEORperplexity

Datasets

MMLUC-EvalWinograndeCommonsenseQAHellaSwagARC

Benchmarks

MMLUC-EvalWinograndeCommonsenseQAHellaSwagARC