Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
6
Why It Matters For Business
If model evaluation is contaminated, product decisions and vendor comparisons can be wrong; verify exposure to benchmarks before basing choices on published scores.
Summary TLDR
This position paper argues that benchmark data contamination—when a model has seen test data during its training—threatens NLP evaluation. The authors define three contamination types (guideline, raw text, annotation), show contamination can occur at pretraining, fine-tuning and post-deployment steps, and propose practical detection measures: overlap search for open models and memorization/extractability tests for closed models. They call for a community registry, tooling, and review-time checks to flag compromised results.
Problem Statement
When a model has been trained on a benchmark's test data, reported performance is inflated and scientific claims can be wrong. Data exposure can come from many sources and is hard to detect, especially for closed models, so routine evaluations may be unreliable.
Main Contribution
Clarifies three contamination types: guideline, raw text, annotation.
Maps where contamination can occur: pretraining, supervised fine-tuning, post-deployment.
Proposes measurable signals: benchmark-data overlap for open models and extractability/memorization tests for closed models.
Calls for a public registry of contamination cases and changes in peer review and reporting.
Key Findings
Contamination inflates evaluated model performance and can lead to wrong scientific conclusions.
There are three distinct contamination types: guideline, raw text, and annotation.
For closed models, contamination can be measured by extractability (memorization): fraction of examples a model reproduces when prompted.
Results
verbatim regeneration of benchmark examples
recommended contamination measures
Who Should Care
What To Try In 7 Days
Run quick memorization prompts on closed models for key benchmarks (extractability test).
Search open training corpora for benchmark examples using ROOTS or Data Portraits when available.
Add a contamination check to model evaluation steps and document results in reports or PRDs.
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Position paper: proposes ideas but provides limited systematic measurement results.
- Detecting contamination in closed models remains manual and is currently hard to scale.
- Registry and tooling require community coordination and sustained effort.
When Not To Use
- Do not rely solely on memorization negative results to prove non-contamination.
- Avoid treating overlap/extractability measures as definitive without reporting methodology details.
Failure Modes
- False negatives: model was trained on data but does not memorize or reproduce it.
- False positives: model reproduces text from mirrors or unrelated web copies without original benchmark exposure.
- Incomplete evidence: partial overlap may not indicate full contamination of test splits.
Core Entities
Models
- GPT-3
- GPT-4
- ChatGPT
- LLaMA
- LLaMA 2
- WizardCoder
- BLOOM
- Codex
- GitHub Copilot
Metrics
- extractability (memorization fraction)
- benchmark data overlap (percentage overlap)
Datasets
- CoNLL2003
- GSM8K
- MATH
- BIG-bench
- GLUE
- SuperGLUE
- XNLI
- MultiCoNER2
- IMDB
- CNN/DailyMail
- C4
Benchmarks
- CoNLL2003
- GSM8K
- MATH
- BIG-bench
- GLUE
- SuperGLUE
- XNLI

