A low-cost, practical method that finds whether LLMs memorized evaluation datasets

August 16, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

22

Authors

Shahriar Golchin, Mihai Surdeanu

Links

Abstract / PDF

Why It Matters For Business

If an LLM already saw your test data, reported performance is not a real measure of capability; this cheap detection method helps teams vet benchmarks and avoid overclaiming model quality.

Summary TLDR

The paper introduces a simple, low-cost procedure to detect whether a large language model (LLM) has ingested a given dataset split. It prompts the model with a short initial fragment plus the dataset and split name ('guided instruction') and checks if the model completes the fragment exactly or nearly so. Two partition-level rules are proposed: (1) guided completions have significantly higher BLEURT/ROUGE-L overlap than unguided completions, or (2) a GPT-4 few-shot classifier flags at least one exact or two near-exact matches in a 10-instance sample. Tested on GPT-4 and GPT-3.5 across seven datasets, the best method (guided instruction + GPT-4 few-shot classifier) matched human labels in 14

Problem Statement

Benchmark scores for LLMs can be artificially high if test data or dataset instances leaked into model pretraining. Most prior leakage checks need access to pretraining data or heavy compute. The paper asks: can we detect dataset contamination automatically, cheaply, and without access to the model's pretraining corpus?

Main Contribution

A simple, instance-level contamination protocol: feed a random initial fragment + dataset name + split to an LLM (guided instruction) and ask it to finish the instance.

Two partition-level rules to label a split as contaminated: bootstrap-tested overlap improvement (guided vs general) and a GPT-4 few-shot exact/near-exact classifier on generated completions.

An evaluation on GPT-4 and GPT-3.5 across seven datasets showing the guided+GPT-4 classifier matches human judgment (92%–100%).

A controlled contamination experiment where the authors intentionally fine-tune GPT-3.5 to validate the detection rules.

Key Findings

Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.

NumbersGPT-4 14/14 (100%); GPT-3.5 13/14 (92.86%) on 14 partitions

GPT-4 shows evidence of having ingested multiple dataset splits used in NLP benchmarks.

NumbersAG News train+test and WNLI train+test flagged; XSum test flagged

Overlap metrics can detect contamination on strong models but are inconsistent across models.

NumbersROUGE-L: GPT-4 13/14 (92.86%) vs GPT-3.5 7/14 (50%)

Prior heuristic (ChatGPT-Cheat?) fails on GPT-4 because safety filters produce 'suspicious' outputs.

NumbersStrict evaluation: ChatGPT-Cheat? 0/14 correct on GPT-4

Results

Algorithm 2 (Guided + GPT-4 few-shot) success

ValueGPT-4: 14/14 (100%); GPT-3.5: 13/14 (92.86%)

BaselineHuman annotations on same 14 partitions

Algorithm 1 (ROUGE-L overlap) success

ValueGPT-4: 13/14 (92.86%); GPT-3.5: 7/14 (50.00%)

BaselineHuman annotations

ChatGPT-Cheat? method

ValueGPT-4 strict: 0/14 (0%); GPT-4 lenient: 9/14 (64.29%); GPT-3.5 strict: 11/14 (78.57%); GPT-3.5 lenient: 13/14 (92.86%)

BaselineHuman annotations

Controlled contamination validation

ValueIntentional fine-tuning reproduces at least one exact match per contaminated setting

BaselinePre-contamination base model produced no exact matches

Who Should Care

What To Try In 7 Days

Sample 10 examples from any benchmark split you plan to use.

Run guided instruction completions (include dataset+split+fragment) with your target LLM.

Evaluate completions with a GPT-4 few-shot prompt or compute ROUGE-L/BLEURT and compare to unguided completions via bootstrap resampling if GPT-4 is unavailable.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not distinguish contamination source or type (direct instance vs duplicated web text).
  • Relies on a small random sample (10 instances), which can miss sparse contamination.
  • Best-performing decision rule depends on access to GPT-4 for few-shot classification.
  • Overlap metrics can give inconsistent results across model families.

When Not To Use

  • When you need to locate the exact origin or source of leaked data.
  • When you lack access to a reliable GPT-4-style classifier for decision-making.
  • For very small datasets where coincidental overlap is common.

Failure Modes

  • False positives when a completion coincidentally matches common web text or templates.
  • False negatives when the model paraphrases memorized content beyond near-exact matching.
  • Safety filters or content redaction can produce 'suspicious' outputs that mask contamination.
  • Small-sample sampling error can miss contaminated instances.

Core Entities

Models

  • GPT-4
  • gpt-4-0613
  • GPT-3.5
  • gpt-3.5-turbo-0613
  • GPT-3.5 base (fine-tuned checkpoints)

Metrics

  • ROUGE-L
  • BLEURT-20
  • bootstrap resampling (10k samples)
  • human annotation (exact/near-exact)
  • few-shot in-context classification (GPT-4)

Datasets

  • AG News
  • IMDB
  • Yelp Full Reviews
  • SAMSum
  • XSum
  • WNLI
  • RTE
  • GSM8k (used in validation)