Overview
The paper is a survey that compiles multi-source evidence; recommendations are practical but many mitigations need extra compute or private data to scale.
Citations6
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Contaminated benchmarks can make models look better than they are, misleading product decisions and inflating R&D ROI claims.
Who Should Care
Summary TLDR
This survey defines Benchmark Data Contamination (BDC)—when evaluation data or related information appears in LLM training data—and reviews detection and mitigation work. It groups detection into matching-based (string/overlap, membership inference, generation) and comparison-based (distribution, perplexity, time-based) approaches. Mitigations fall into three families: curate new private/dynamic benchmarks, refactor existing data (regenerate/augment/filter), or move to benchmark-free evaluation (LLM-as-judge or human-in-the-loop). The paper collects empirical findings from prior studies (contamination rates reported from 1%–45%, targeted countermeasures that reduce or reveal inflated scores)
Problem Statement
When LLMs have seen test examples or benchmark signals during training, reported scores can be inflated and misleading. This survey maps how contamination happens, how to detect it, and how to reduce its impact across common tasks and benchmarks.
Main Contribution
A clear definition and four-level taxonomy of benchmark data contamination (semantic, information, data, label)
A structured review of detection techniques: matching-based and comparison-based methods
Key Findings
BDC has four severity levels: semantic, information, data, and label exposure.
Reported contamination prevalence varies widely across studies.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| contamination prevalence | 1%–45% | — | — | multiple MCQA benchmarks (Li et al. [91]) | Open-source contamination report found 1%–45% contamination across models and corpora | Li et al. [91] |
| HumanEval pass rate change under EvoEval | -39.4% (average reduction) | original HumanEval pass rates | -39.4% | HumanEval, 51 LLMs (Xia et al. [158]) | EvoEval prompts lowered pass rates by 39.4% on average | Xia et al. [158] |
What To Try In 7 Days
Run simple contamination checks: n-gram overlap and perplexity on your test set
Compare model performance on recent or dynamically collected examples to spot temporal leakage
Use variant prompts or paraphrases of key test items to assess robustness quickly
Reproducibility
Risks & Boundaries
Limitations
No single mitigation fully prevents semantic- or information-level contamination
Private or dynamic benchmarks require resources and reduce reproducibility
When Not To Use
Rely on public static benchmarks alone for final model claims
Trust single-method contamination detectors without cross-checks
Failure Modes
Paraphrase or evasive fine-tuning hides contamination from string-matching
AIGC creates second-order contamination: regenerated benchmarks leak back into future training corpora

