Overview
The paper gives practical, low‑barrier checks (retrieval and masking) with concrete numeric signals, but some results depend on manual review and closed‑model probing.
Citations8
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
If test examples leak into model training, reported model gains may be inflated; flagging and removing leaked examples preserves honest evaluation and prevents bad product decisions.
Who Should Care
Summary TLDR
The paper presents two practical ways to detect benchmark data leaking into LLM training. First, an IR pipeline (Pyserini + BM25) searches open pretraining corpora (The Pile, C4) for overlaps. Second, TS-Guessing masks a key word or a wrong multiple-choice option and asks models to fill it. Strong closed-source models often reproduce masked benchmark content (e.g., ChatGPT/GPT-4 hit ~52–57% exact matches on masked MMLU items), and fine-tuning with a test set drives that to ~100% — a clear sign of contamination risk.
Problem Statement
Benchmark examples that appear in model training data can inflate reported accuracy. Existing n‑gram overlap checks need full training corpora and still miss many leaks. We need lightweight, model‑agnostic probes and retrieval checks to flag probable contamination for both open and closed models.
Main Contribution
A retrieval pipeline (Pyserini + BM25) to search open pretraining corpora (The Pile, C4) for benchmark overlap.
TS-Guessing: a masking protocol that asks LLMs to fill a masked keyword or a masked wrong multiple‑choice option to reveal memorized test items.
Key Findings
Closed-source LLMs often reproduce masked wrong options in MMLU.
Fine-tuning on the test set makes masked exact-match near perfect.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Exact Match (EM) — TS-Guessing on MMLU | ChatGPT 52%, GPT-4 57% | — | — | MMLU (filtered test items) | Table 3 reports EM for masked wrong option task | Table 3; Section 4.2.2 |
| Exact Match (EM) after deliberate contamination | ≈100% EM | ChatGPT pre-finetune EM 52% | +~48 pp | MMLU (contamination probe) | Fine-tuned ChatGPT reproduces masked items nearly perfectly | Figure 4; Section 4.3 |
What To Try In 7 Days
Run TS-Guessing on your model: mask keywords and wrong options and track EM.
Index public corpora (C4/The Pile) with Pyserini and run question+label queries for overlap.
Add a small human check (50–100 examples) where automated metrics disagree with semantic judgment.
Reproducibility
Risks & Boundaries
Limitations
Retrieval uses only BM25 and may miss semantic matches (§7).
Indexing and retrieval are slow and resource heavy (≈2–3 minutes per point).
When Not To Use
When full training-data access is available — prefer corpus-level n‑gram checks.
On very short or indexical questions where masking cannot isolate a keyword.
Failure Modes
False positives: model correctly guesses masked content by chance or world knowledge.
False negatives: contamination exists but model fails to reproduce masked text.

