Survey of how benchmark leaks (data contamination) distort LLM evaluations and practical fixes

June 6, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Cheng Xu, Shuhao Guan, Derek Greene, M-Tahar Kechadi

Links

Abstract / PDF

Why It Matters For Business

Contaminated benchmarks can make models look better than they are, misleading product decisions and inflating R&D ROI claims.

Summary TLDR

This survey defines Benchmark Data Contamination (BDC)—when evaluation data or related information appears in LLM training data—and reviews detection and mitigation work. It groups detection into matching-based (string/overlap, membership inference, generation) and comparison-based (distribution, perplexity, time-based) approaches. Mitigations fall into three families: curate new private/dynamic benchmarks, refactor existing data (regenerate/augment/filter), or move to benchmark-free evaluation (LLM-as-judge or human-in-the-loop). The paper collects empirical findings from prior studies (contamination rates reported from 1%–45%, targeted countermeasures that reduce or reveal inflated scores)

Problem Statement

When LLMs have seen test examples or benchmark signals during training, reported scores can be inflated and misleading. This survey maps how contamination happens, how to detect it, and how to reduce its impact across common tasks and benchmarks.

Main Contribution

A clear definition and four-level taxonomy of benchmark data contamination (semantic, information, data, label)

A structured review of detection techniques: matching-based and comparison-based methods

A structured review of mitigation strategies: data curation, data refactoring, and benchmark-free evaluation

A synthesis of practical challenges and future directions, plus concrete examples from code, QA, and text benchmarks

Key Findings

BDC has four severity levels: semantic, information, data, and label exposure.

Reported contamination prevalence varies widely across studies.

Numbers1%–45% contamination reported

Simple string-matching detectors miss many contamination cases.

NumbersParaphrasing or evasive fine-tuning can raise scores without matching signals; EAL improves benchmarks up to 15%

Refactoring benchmarks with LLM-generated variants can reveal overfitting.

NumbersEvoEval reduced HumanEval pass rates by 39.4% on average

Some detection combinations achieve high instance-level accuracy.

Numbers92%–100% contamination detection accuracy reported

Results

contamination prevalence

Value1%–45%

HumanEval pass rate change under EvoEval

Value-39.4% (average reduction)

Baselineoriginal HumanEval pass rates

undetected performance gain via evasive augmentation

Valueup to +15% accuracy

Baselineoriginal benchmark scores before EAL

Accuracy

Value92%–100%

Who Should Care

What To Try In 7 Days

Run simple contamination checks: n-gram overlap and perplexity on your test set

Compare model performance on recent or dynamically collected examples to spot temporal leakage

Use variant prompts or paraphrases of key test items to assess robustness quickly

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No single mitigation fully prevents semantic- or information-level contamination
  • Private or dynamic benchmarks require resources and reduce reproducibility
  • LLM-as-judge and LLM-based decontamination inherit bias if judges were trained on contaminated corpora
  • Matching methods (n-gram overlap) are brittle to paraphrase and evasive fine-tuning

When Not To Use

  • Rely on public static benchmarks alone for final model claims
  • Trust single-method contamination detectors without cross-checks
  • Use LLM-as-judge when judge models may share training data with subjects

Failure Modes

  • Paraphrase or evasive fine-tuning hides contamination from string-matching
  • AIGC creates second-order contamination: regenerated benchmarks leak back into future training corpora
  • False negatives from opaque proprietary models when training data is inaccessible
  • Human evaluators bring subjective biases and can be contaminated by prior exposure

Core Entities

Models

  • GPT-3
  • GPT-3.5
  • GPT-4
  • ChatGPT
  • Claude-3
  • Gemini
  • LLaMA
  • PaLM

Metrics

  • Accuracy
  • perplexity
  • output distribution divergence (CDD)
  • contamination percent
  • pass rate
  • expected calibration error

Datasets

  • HumanEval
  • Spider
  • Termite
  • DetCon
  • ComiEval
  • Codeforces
  • Project Euler
  • AG News
  • WNLI
  • XSum
  • C4 (Colossal Clean Crawled Corpus)

Benchmarks

  • HumanEval
  • Dynaboard
  • LiveCodeBench
  • LatestEval
  • EvoEval
  • DyVal / DyVal2
  • TreeEval
  • FreeEval
  • Chatbot Arena
  • AlpacaEval

Context Entities

Models

  • Code Llama
  • Alpaca

Metrics

  • attack success rate
  • performance drop rate

Datasets

  • Spider
  • HumanEval