Mask-and-retrieve tests show many benchmarks can leak into LLM training

November 16, 20237 min

Overview

Decision SnapshotNeeds Validation

The paper gives practical, low‑barrier checks (retrieval and masking) with concrete numeric signals, but some results depend on manual review and closed‑model probing.

Citations8

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Links

Abstract / PDF

Why It Matters For Business

If test examples leak into model training, reported model gains may be inflated; flagging and removing leaked examples preserves honest evaluation and prevents bad product decisions.

Who Should Care

Summary TLDR

The paper presents two practical ways to detect benchmark data leaking into LLM training. First, an IR pipeline (Pyserini + BM25) searches open pretraining corpora (The Pile, C4) for overlaps. Second, TS-Guessing masks a key word or a wrong multiple-choice option and asks models to fill it. Strong closed-source models often reproduce masked benchmark content (e.g., ChatGPT/GPT-4 hit ~52–57% exact matches on masked MMLU items), and fine-tuning with a test set drives that to ~100% — a clear sign of contamination risk.

Problem Statement

Benchmark examples that appear in model training data can inflate reported accuracy. Existing n‑gram overlap checks need full training corpora and still miss many leaks. We need lightweight, model‑agnostic probes and retrieval checks to flag probable contamination for both open and closed models.

Main Contribution

A retrieval pipeline (Pyserini + BM25) to search open pretraining corpora (The Pile, C4) for benchmark overlap.

TS-Guessing: a masking protocol that asks LLMs to fill a masked keyword or a masked wrong multiple‑choice option to reveal memorized test items.

Key Findings

Closed-source LLMs often reproduce masked wrong options in MMLU.

NumbersChatGPT EM 52%, GPT-4 EM 57% on MMLU (Table 3)

Practical UseIf you see >50% EM on masked MMLU items, suspect training‑data leakage; do not treat reported accuracy as fully generalizable.

Evidence RefTable 3; Section 4.2.2

Fine-tuning on the test set makes masked exact-match near perfect.

NumbersContaminated ChatGPT EM ≈100% after fine‑tuning (Figure 4, §4.3)

Practical UsePerfect or near‑perfect reproduction of masked test content strongly indicates the model saw the exact examples in training.

Evidence RefFigure 4; Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match (EM) — TS-Guessing on MMLUChatGPT 52%, GPT-4 57%MMLU (filtered test items)Table 3 reports EM for masked wrong option taskTable 3; Section 4.2.2
Exact Match (EM) after deliberate contamination≈100% EMChatGPT pre-finetune EM 52%+~48 ppMMLU (contamination probe)Fine-tuned ChatGPT reproduces masked items nearly perfectlyFigure 4; Section 4.3

What To Try In 7 Days

Run TS-Guessing on your model: mask keywords and wrong options and track EM.

Index public corpora (C4/The Pile) with Pyserini and run question+label queries for overlap.

Add a small human check (50–100 examples) where automated metrics disagree with semantic judgment.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Retrieval uses only BM25 and may miss semantic matches (§7).

Indexing and retrieval are slow and resource heavy (≈2–3 minutes per point).

When Not To Use

When full training-data access is available — prefer corpus-level n‑gram checks.

On very short or indexical questions where masking cannot isolate a keyword.

Failure Modes

False positives: model correctly guesses masked content by chance or world knowledge.

False negatives: contamination exists but model fails to reproduce masked text.

Core Entities

Models

ChatGPT (GPT-3.5-turbo)GPT-4Claude-instant-1-100kClaude-2LLaMa 2-13BMistral-7B

Metrics

Exact Match (EM)Rouge-L F1BM25SacreBLEUBLEURTGPTScore

Datasets

MMLUTruthfulQAHellaSwagWinoGrandeGSM8KOpenbookQAPIQAThe PileC4

Benchmarks

MMLUTruthfulQAHellaSwagWinoGrandeGSM8KOpenbookQAPIQA

Context Entities

Datasets

The PileC4