Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
Contaminated benchmarks can make models look better than they are, misleading product decisions and inflating R&D ROI claims.
Summary TLDR
This survey defines Benchmark Data Contamination (BDC)—when evaluation data or related information appears in LLM training data—and reviews detection and mitigation work. It groups detection into matching-based (string/overlap, membership inference, generation) and comparison-based (distribution, perplexity, time-based) approaches. Mitigations fall into three families: curate new private/dynamic benchmarks, refactor existing data (regenerate/augment/filter), or move to benchmark-free evaluation (LLM-as-judge or human-in-the-loop). The paper collects empirical findings from prior studies (contamination rates reported from 1%–45%, targeted countermeasures that reduce or reveal inflated scores)
Problem Statement
When LLMs have seen test examples or benchmark signals during training, reported scores can be inflated and misleading. This survey maps how contamination happens, how to detect it, and how to reduce its impact across common tasks and benchmarks.
Main Contribution
A clear definition and four-level taxonomy of benchmark data contamination (semantic, information, data, label)
A structured review of detection techniques: matching-based and comparison-based methods
A structured review of mitigation strategies: data curation, data refactoring, and benchmark-free evaluation
A synthesis of practical challenges and future directions, plus concrete examples from code, QA, and text benchmarks
Key Findings
BDC has four severity levels: semantic, information, data, and label exposure.
Reported contamination prevalence varies widely across studies.
Simple string-matching detectors miss many contamination cases.
Refactoring benchmarks with LLM-generated variants can reveal overfitting.
Some detection combinations achieve high instance-level accuracy.
Results
contamination prevalence
HumanEval pass rate change under EvoEval
undetected performance gain via evasive augmentation
Accuracy
Who Should Care
What To Try In 7 Days
Run simple contamination checks: n-gram overlap and perplexity on your test set
Compare model performance on recent or dynamically collected examples to spot temporal leakage
Use variant prompts or paraphrases of key test items to assess robustness quickly
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- No single mitigation fully prevents semantic- or information-level contamination
- Private or dynamic benchmarks require resources and reduce reproducibility
- LLM-as-judge and LLM-based decontamination inherit bias if judges were trained on contaminated corpora
- Matching methods (n-gram overlap) are brittle to paraphrase and evasive fine-tuning
When Not To Use
- Rely on public static benchmarks alone for final model claims
- Trust single-method contamination detectors without cross-checks
- Use LLM-as-judge when judge models may share training data with subjects
Failure Modes
- Paraphrase or evasive fine-tuning hides contamination from string-matching
- AIGC creates second-order contamination: regenerated benchmarks leak back into future training corpora
- False negatives from opaque proprietary models when training data is inaccessible
- Human evaluators bring subjective biases and can be contaminated by prior exposure
Core Entities
Models
- GPT-3
- GPT-3.5
- GPT-4
- ChatGPT
- Claude-3
- Gemini
- LLaMA
- PaLM
Metrics
- Accuracy
- perplexity
- output distribution divergence (CDD)
- contamination percent
- pass rate
- expected calibration error
Datasets
- HumanEval
- Spider
- Termite
- DetCon
- ComiEval
- Codeforces
- Project Euler
- AG News
- WNLI
- XSum
- C4 (Colossal Clean Crawled Corpus)
Benchmarks
- HumanEval
- Dynaboard
- LiveCodeBench
- LatestEval
- EvoEval
- DyVal / DyVal2
- TreeEval
- FreeEval
- Chatbot Arena
- AlpacaEval
Context Entities
Models
- Code Llama
- Alpaca
Metrics
- attack success rate
- performance drop rate
Datasets
- Spider
- HumanEval

