Mask-and-retrieve tests show many benchmarks can leak into LLM training

Overview

Decision SnapshotNeeds Validation

The paper gives practical, low‑barrier checks (retrieval and masking) with concrete numeric signals, but some results depend on manual review and closed‑model probing.

Citations8

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Links

Abstract / PDF

Why It Matters For Business

If test examples leak into model training, reported model gains may be inflated; flagging and removing leaked examples preserves honest evaluation and prevents bad product decisions.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The paper presents two practical ways to detect benchmark data leaking into LLM training. First, an IR pipeline (Pyserini + BM25) searches open pretraining corpora (The Pile, C4) for overlaps. Second, TS-Guessing masks a key word or a wrong multiple-choice option and asks models to fill it. Strong closed-source models often reproduce masked benchmark content (e.g., ChatGPT/GPT-4 hit ~52–57% exact matches on masked MMLU items), and fine-tuning with a test set drives that to ~100% — a clear sign of contamination risk.

Problem Statement

Benchmark examples that appear in model training data can inflate reported accuracy. Existing n‑gram overlap checks need full training corpora and still miss many leaks. We need lightweight, model‑agnostic probes and retrieval checks to flag probable contamination for both open and closed models.

Main Contribution

A retrieval pipeline (Pyserini + BM25) to search open pretraining corpora (The Pile, C4) for benchmark overlap.

TS-Guessing: a masking protocol that asks LLMs to fill a masked keyword or a masked wrong multiple‑choice option to reveal memorized test items.

Key Findings

Closed-source LLMs often reproduce masked wrong options in MMLU.

NumbersChatGPT EM 52%, GPT-4 EM 57% on MMLU (Table 3)

Practical UseIf you see >50% EM on masked MMLU items, suspect training‑data leakage; do not treat reported accuracy as fully generalizable.

Evidence RefTable 3; Section 4.2.2

Fine-tuning on the test set makes masked exact-match near perfect.

NumbersContaminated ChatGPT EM ≈100% after fine‑tuning (Figure 4, §4.3)

Practical UsePerfect or near‑perfect reproduction of masked test content strongly indicates the model saw the exact examples in training.

Evidence RefFigure 4; Section 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match (EM) — TS-Guessing on MMLU	ChatGPT 52%, GPT-4 57%	—	—	MMLU (filtered test items)	Table 3 reports EM for masked wrong option task	Table 3; Section 4.2.2
Exact Match (EM) after deliberate contamination	≈100% EM	ChatGPT pre-finetune EM 52%	+~48 pp	MMLU (contamination probe)	Fine-tuned ChatGPT reproduces masked items nearly perfectly	Figure 4; Section 4.3

What To Try In 7 Days

Run TS-Guessing on your model: mask keywords and wrong options and track EM.

Index public corpora (C4/The Pile) with Pyserini and run question+label queries for overlap.

Add a small human check (50–100 examples) where automated metrics disagree with semantic judgment.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Retrieval uses only BM25 and may miss semantic matches (§7).

Indexing and retrieval are slow and resource heavy (≈2–3 minutes per point).

When Not To Use

When full training-data access is available — prefer corpus-level n‑gram checks.

On very short or indexical questions where masking cannot isolate a keyword.

Failure Modes

False positives: model correctly guesses masked content by chance or world knowledge.

False negatives: contamination exists but model fails to reproduce masked text.

Core Entities

Models

ChatGPT (GPT-3.5-turbo)GPT-4Claude-instant-1-100kClaude-2LLaMa 2-13BMistral-7B

Metrics

Exact Match (EM)Rouge-L F1BM25SacreBLEUBLEURTGPTScore

Datasets

MMLUTruthfulQAHellaSwagWinoGrandeGSM8KOpenbookQAPIQAThe PileC4

Benchmarks

MMLUTruthfulQAHellaSwagWinoGrandeGSM8KOpenbookQAPIQA

Context Entities

Datasets

The PileC4

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Closed-source LLMs often reproduce masked wrong options in MMLU.

Fine-tuning on the test set makes masked exact-match near perfect.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding