Mask-and-retrieve tests show many benchmarks can leak into LLM training

November 16, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

8

Authors

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Links

Abstract / PDF

Why It Matters For Business

If test examples leak into model training, reported model gains may be inflated; flagging and removing leaked examples preserves honest evaluation and prevents bad product decisions.

Summary TLDR

The paper presents two practical ways to detect benchmark data leaking into LLM training. First, an IR pipeline (Pyserini + BM25) searches open pretraining corpora (The Pile, C4) for overlaps. Second, TS-Guessing masks a key word or a wrong multiple-choice option and asks models to fill it. Strong closed-source models often reproduce masked benchmark content (e.g., ChatGPT/GPT-4 hit ~52–57% exact matches on masked MMLU items), and fine-tuning with a test set drives that to ~100% — a clear sign of contamination risk.

Problem Statement

Benchmark examples that appear in model training data can inflate reported accuracy. Existing n‑gram overlap checks need full training corpora and still miss many leaks. We need lightweight, model‑agnostic probes and retrieval checks to flag probable contamination for both open and closed models.

Main Contribution

A retrieval pipeline (Pyserini + BM25) to search open pretraining corpora (The Pile, C4) for benchmark overlap.

TS-Guessing: a masking protocol that asks LLMs to fill a masked keyword or a masked wrong multiple‑choice option to reveal memorized test items.

Empirical evaluation showing closed-source LLMs often reproduce masked benchmark content (notably high exact-match on MMLU) and that fine-tuning on a test set makes EM nearly 100%.

Key Findings

Closed-source LLMs often reproduce masked wrong options in MMLU.

NumbersChatGPT EM 52%, GPT-4 EM 57% on MMLU (Table 3)

Fine-tuning on the test set makes masked exact-match near perfect.

NumbersContaminated ChatGPT EM ≈100% after fine‑tuning (Figure 4, §4.3)

Manual review found notable overlap missed by automatic n‑gram checks.

NumbersHuman judges labeled 23/100 examples as contaminated; Krippendorff's α = 0.8673 (§4.1.2)

TS-Guessing produces nontrivial hits on TruthfulQA.

NumbersTruthfulQA question-based success ≈16.24% (Table 2 discussion)

Results

Exact Match (EM) — TS-Guessing on MMLU

ValueChatGPT 52%, GPT-4 57%

Exact Match (EM) after deliberate contamination

Value≈100% EM

BaselineChatGPT pre-finetune EM 52%

Human-labeled contamination rate in retrieval spot-check

Value23%

Retrieval effectiveness (query type)

ValueBM25=25.12, Avg F1=0.31

Baselinequestion-only BM25=20.23, Avg F1=0.24

Who Should Care

What To Try In 7 Days

Run TS-Guessing on your model: mask keywords and wrong options and track EM.

Index public corpora (C4/The Pile) with Pyserini and run question+label queries for overlap.

Add a small human check (50–100 examples) where automated metrics disagree with semantic judgment.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Retrieval uses only BM25 and may miss semantic matches (§7).
  • Indexing and retrieval are slow and resource heavy (≈2–3 minutes per point).
  • TS-Guessing relies on models following instructions and may need few‑shot prompts.
  • Automatic text-similarity scores can disagree with human judgment; GPTScore is costlier.

When Not To Use

  • When full training-data access is available — prefer corpus-level n‑gram checks.
  • On very short or indexical questions where masking cannot isolate a keyword.
  • As the sole proof of contamination — it is a strong signal but not definitive.

Failure Modes

  • False positives: model correctly guesses masked content by chance or world knowledge.
  • False negatives: contamination exists but model fails to reproduce masked text.
  • Metric mismatch: high text-similarity score without semantic match, or vice versa.

Core Entities

Models

  • ChatGPT (GPT-3.5-turbo)
  • GPT-4
  • Claude-instant-1-100k
  • Claude-2
  • LLaMa 2-13B
  • Mistral-7B

Metrics

  • Exact Match (EM)
  • Rouge-L F1
  • BM25
  • SacreBLEU
  • BLEURT
  • GPTScore

Datasets

  • MMLU
  • TruthfulQA
  • HellaSwag
  • WinoGrande
  • GSM8K
  • OpenbookQA
  • PIQA
  • The Pile
  • C4

Benchmarks

  • MMLU
  • TruthfulQA
  • HellaSwag
  • WinoGrande
  • GSM8K
  • OpenbookQA
  • PIQA

Context Entities

Datasets

  • The Pile
  • C4