Open audit and pipeline show 1%–46% test-set leakage and uneven score inflation across six popular benchmarks

October 26, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.4

Citation Count

3

Authors

Yucheng Li, Frank Guerin, Chenghua Lin

Links

Abstract / PDF

Why It Matters For Business

Contaminated benchmarks can inflate model metrics and lead teams to pick models that simply memorised test examples rather than truly generalising.

Summary TLDR

The authors release an open pipeline that checks whether test questions appear verbatim in Common Crawl using Bing + METEOR matching. They report contamination rates of 1.1%–45.8% across six multi-choice benchmarks (Winogrande, CommonsenseQA, HellaSwag, ARC, MMLU, C-Eval). Contamination has grown over time and can inflate accuracy (up to +14% on C-Eval and +7% on HellaSwag on contaminated subsets), but effects vary by dataset and model size. Larger models tend to benefit more from contaminated test items. The pipeline and domain analysis are shared to help teams audit and reduce leakage.

Problem Statement

Benchmark test examples often leak into models' training data, allowing memorisation to boost reported scores and misleading comparisons. Existing contamination checks are incomplete and opaque. The paper builds an open, reproducible method to audit contamination across models and benchmarks and to measure how leakage changes reported performance.

Main Contribution

An open-source pipeline that uses Bing search + Common Crawl index and METEOR matching to flag contaminated test examples.

A contamination audit across 15+ LLMs and six multi-choice QA benchmarks, reporting per-dataset contamination rates and growth over time.

Analysis of how contamination affects accuracy, showing dataset-dependent effects and that larger models often gain more.

Domain-level analysis that shows contamination concentrates in a few sites and discussion of mitigation trade-offs.

Key Findings

Contamination varies strongly by benchmark.

NumbersC-Eval 45.8%; MMLU 29.1%; HellaSwag 12.4%; ARC 28.7%; CommonsenseQA 1.6%; Winogrande 1.1%

Contamination increased over recent years.

NumbersUp to +21% increase for some academic benchmarks (2017-2020 → 2020-2023)

Contamination can inflate accuracy but effects are dataset-dependent.

NumbersUp to +14% on C-Eval (Yi 6B), up to +7% on HellaSwag; little change on MMLU

Larger models tend to benefit more from contaminated data.

NumbersLLaMA-2 70B gained larger boosts (e.g., +6% to +11% on some contaminated splits) vs smaller variants

Most contamination is input-and-label (question+answer) rather than input-only.

NumbersLarge share of dirty cases are input-and-label (e.g., C-Eval: 40.6% input-and-label vs 5.1% input-only)

Results

Contamination rate

Value45.8% (C-Eval dev)

Contamination rate

Value29.1% (MMLU test)

Contamination rate

Value12.4% (HellaSwag dev)

Accuracy

Value+14.0%

Baselineclean subset accuracy

Accuracy

Value+7.0%

Baselineclean subset accuracy

Temporal increase in contamination

Valueup to +21.0% (some academic benchmarks)

Baseline2017-2020 window

Who Should Care

What To Try In 7 Days

Run the open pipeline on your evaluation sets to flag input and input+label leakage.

Recompute model comparisons on clean-only subsets before product decisions.

Block or monitor a short list of high-leakage domains when assembling training data.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Method relies on Bing + Common Crawl; it misses training data not present in Common Crawl or not indexed by search engines.
  • Search API costs and query-length limits restrict scale and prevent long-passage benchmarks from being checked.
  • METEOR-based matching and thresholds can still yield false positives and false negatives.

When Not To Use

  • Evaluating benchmarks with very long passages that exceed search API query limits.
  • Auditing models trained largely on private or proprietary corpora not in Common Crawl.
  • When you need exact provenance of every training token rather than likely overlap signals.

Failure Modes

  • False negatives if training data copies are paraphrased beyond METEOR threshold.
  • False positives if matched pages quote test content out of context (mitigated but not eliminated by windowing and order penalty).
  • Blocklists become ineffective as leaked content migrates to new domains over time.

Core Entities

Models

  • LLaMA
  • Llama-2
  • Yi
  • Mistral
  • Baichuan
  • Qwen
  • Mistral-FT
  • Llama-2 Chat
  • Llama-2 Chat 70B
  • Yi 6B
  • Qwen 7B
  • Baichuan2 7B

Metrics

  • Accuracy
  • METEOR
  • perplexity

Datasets

  • MMLU
  • C-Eval
  • Winogrande
  • CommonsenseQA
  • HellaSwag
  • ARC

Benchmarks

  • MMLU
  • C-Eval
  • Winogrande
  • CommonsenseQA
  • HellaSwag
  • ARC