Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
8
Why It Matters For Business
If test examples leak into model training, reported model gains may be inflated; flagging and removing leaked examples preserves honest evaluation and prevents bad product decisions.
Summary TLDR
The paper presents two practical ways to detect benchmark data leaking into LLM training. First, an IR pipeline (Pyserini + BM25) searches open pretraining corpora (The Pile, C4) for overlaps. Second, TS-Guessing masks a key word or a wrong multiple-choice option and asks models to fill it. Strong closed-source models often reproduce masked benchmark content (e.g., ChatGPT/GPT-4 hit ~52–57% exact matches on masked MMLU items), and fine-tuning with a test set drives that to ~100% — a clear sign of contamination risk.
Problem Statement
Benchmark examples that appear in model training data can inflate reported accuracy. Existing n‑gram overlap checks need full training corpora and still miss many leaks. We need lightweight, model‑agnostic probes and retrieval checks to flag probable contamination for both open and closed models.
Main Contribution
A retrieval pipeline (Pyserini + BM25) to search open pretraining corpora (The Pile, C4) for benchmark overlap.
TS-Guessing: a masking protocol that asks LLMs to fill a masked keyword or a masked wrong multiple‑choice option to reveal memorized test items.
Empirical evaluation showing closed-source LLMs often reproduce masked benchmark content (notably high exact-match on MMLU) and that fine-tuning on a test set makes EM nearly 100%.
Key Findings
Closed-source LLMs often reproduce masked wrong options in MMLU.
Fine-tuning on the test set makes masked exact-match near perfect.
Manual review found notable overlap missed by automatic n‑gram checks.
TS-Guessing produces nontrivial hits on TruthfulQA.
Results
Exact Match (EM) — TS-Guessing on MMLU
Exact Match (EM) after deliberate contamination
Human-labeled contamination rate in retrieval spot-check
Retrieval effectiveness (query type)
Who Should Care
What To Try In 7 Days
Run TS-Guessing on your model: mask keywords and wrong options and track EM.
Index public corpora (C4/The Pile) with Pyserini and run question+label queries for overlap.
Add a small human check (50–100 examples) where automated metrics disagree with semantic judgment.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Retrieval uses only BM25 and may miss semantic matches (§7).
- Indexing and retrieval are slow and resource heavy (≈2–3 minutes per point).
- TS-Guessing relies on models following instructions and may need few‑shot prompts.
- Automatic text-similarity scores can disagree with human judgment; GPTScore is costlier.
When Not To Use
- When full training-data access is available — prefer corpus-level n‑gram checks.
- On very short or indexical questions where masking cannot isolate a keyword.
- As the sole proof of contamination — it is a strong signal but not definitive.
Failure Modes
- False positives: model correctly guesses masked content by chance or world knowledge.
- False negatives: contamination exists but model fails to reproduce masked text.
- Metric mismatch: high text-similarity score without semantic match, or vice versa.
Core Entities
Models
- ChatGPT (GPT-3.5-turbo)
- GPT-4
- Claude-instant-1-100k
- Claude-2
- LLaMa 2-13B
- Mistral-7B
Metrics
- Exact Match (EM)
- Rouge-L F1
- BM25
- SacreBLEU
- BLEURT
- GPTScore
Datasets
- MMLU
- TruthfulQA
- HellaSwag
- WinoGrande
- GSM8K
- OpenbookQA
- PIQA
- The Pile
- C4
Benchmarks
- MMLU
- TruthfulQA
- HellaSwag
- WinoGrande
- GSM8K
- OpenbookQA
- PIQA
Context Entities
Datasets
- The Pile
- C4

