A low-cost, practical method that finds whether LLMs memorized evaluation datasets

August 16, 20238 min

Overview

Decision SnapshotNeeds Validation

The method is simple and low-cost and validated across 28 settings with human labels; reliance on GPT-4 for classification is the main constraint.

Citations22

Evidence Strength0.90

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Shahriar Golchin, Mihai Surdeanu

Links

Abstract / PDF / Code

Why It Matters For Business

If an LLM already saw your test data, reported performance is not a real measure of capability; this cheap detection method helps teams vet benchmarks and avoid overclaiming model quality.

Who Should Care

Summary TLDR

The paper introduces a simple, low-cost procedure to detect whether a large language model (LLM) has ingested a given dataset split. It prompts the model with a short initial fragment plus the dataset and split name ('guided instruction') and checks if the model completes the fragment exactly or nearly so. Two partition-level rules are proposed: (1) guided completions have significantly higher BLEURT/ROUGE-L overlap than unguided completions, or (2) a GPT-4 few-shot classifier flags at least one exact or two near-exact matches in a 10-instance sample. Tested on GPT-4 and GPT-3.5 across seven datasets, the best method (guided instruction + GPT-4 few-shot classifier) matched human labels in 14

Problem Statement

Benchmark scores for LLMs can be artificially high if test data or dataset instances leaked into model pretraining. Most prior leakage checks need access to pretraining data or heavy compute. The paper asks: can we detect dataset contamination automatically, cheaply, and without access to the model's pretraining corpus?

Main Contribution

A simple, instance-level contamination protocol: feed a random initial fragment + dataset name + split to an LLM (guided instruction) and ask it to finish the instance.

Two partition-level rules to label a split as contaminated: bootstrap-tested overlap improvement (guided vs general) and a GPT-4 few-shot exact/near-exact classifier on generated completions.

Key Findings

Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.

NumbersGPT-4 14/14 (100%); GPT-3.5 13/14 (92.86%) on 14 partitions

Practical UseTo detect leakage quickly, run guided completions on a 10-instance sample and evaluate with a GPT-4 few-shot classifier; high agreement with human judgment.

Evidence RefTable 3; Section 5

GPT-4 shows evidence of having ingested multiple dataset splits used in NLP benchmarks.

NumbersAG News train+test and WNLI train+test flagged; XSum test flagged

Practical UseDo not trust evaluations that use those contaminated splits on GPT-4; validate benchmark splits before reporting results.

Evidence RefHuman evaluation and Table 4 (Section 5)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Algorithm 2 (Guided + GPT-4 few-shot) successGPT-4: 14/14 (100%); GPT-3.5: 13/14 (92.86%)Human annotations on same 14 partitions14 partitions across 7 datasets (train & test/valid)Table 3; Section 5Table 3
Algorithm 1 (ROUGE-L overlap) successGPT-4: 13/14 (92.86%); GPT-3.5: 7/14 (50.00%)Human annotations14 partitionsTable 3; Section 5Table 3

What To Try In 7 Days

Sample 10 examples from any benchmark split you plan to use.

Run guided instruction completions (include dataset+split+fragment) with your target LLM.

Evaluate completions with a GPT-4 few-shot prompt or compute ROUGE-L/BLEURT and compare to unguided completions via bootstrap resampling if GPT-4 is unavailable.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Does not distinguish contamination source or type (direct instance vs duplicated web text).

Relies on a small random sample (10 instances), which can miss sparse contamination.

When Not To Use

When you need to locate the exact origin or source of leaked data.

When you lack access to a reliable GPT-4-style classifier for decision-making.

Failure Modes

False positives when a completion coincidentally matches common web text or templates.

False negatives when the model paraphrases memorized content beyond near-exact matching.

Core Entities

Models

GPT-4gpt-4-0613GPT-3.5gpt-3.5-turbo-0613GPT-3.5 base (fine-tuned checkpoints)

Metrics

ROUGE-LBLEURT-20bootstrap resampling (10k samples)human annotation (exact/near-exact)few-shot in-context classification (GPT-4)

Datasets

AG NewsIMDBYelp Full ReviewsSAMSumXSumWNLIRTEGSM8k (used in validation)