A low-cost, practical method that finds whether LLMs memorized evaluation datasets

Overview

Decision SnapshotNeeds Validation

The method is simple and low-cost and validated across 28 settings with human labels; reliance on GPT-4 for classification is the main constraint.

Citations22

Evidence Strength0.90

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 60%

Authors

Shahriar Golchin, Mihai Surdeanu

Links

Abstract / PDF / Code

Why It Matters For Business

If an LLM already saw your test data, reported performance is not a real measure of capability; this cheap detection method helps teams vet benchmarks and avoid overclaiming model quality.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The paper introduces a simple, low-cost procedure to detect whether a large language model (LLM) has ingested a given dataset split. It prompts the model with a short initial fragment plus the dataset and split name ('guided instruction') and checks if the model completes the fragment exactly or nearly so. Two partition-level rules are proposed: (1) guided completions have significantly higher BLEURT/ROUGE-L overlap than unguided completions, or (2) a GPT-4 few-shot classifier flags at least one exact or two near-exact matches in a 10-instance sample. Tested on GPT-4 and GPT-3.5 across seven datasets, the best method (guided instruction + GPT-4 few-shot classifier) matched human labels in 14

Problem Statement

Benchmark scores for LLMs can be artificially high if test data or dataset instances leaked into model pretraining. Most prior leakage checks need access to pretraining data or heavy compute. The paper asks: can we detect dataset contamination automatically, cheaply, and without access to the model's pretraining corpus?

Main Contribution

A simple, instance-level contamination protocol: feed a random initial fragment + dataset name + split to an LLM (guided instruction) and ask it to finish the instance.

Two partition-level rules to label a split as contaminated: bootstrap-tested overlap improvement (guided vs general) and a GPT-4 few-shot exact/near-exact classifier on generated completions.

Key Findings

Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.

NumbersGPT-4 14/14 (100%); GPT-3.5 13/14 (92.86%) on 14 partitions

Practical UseTo detect leakage quickly, run guided completions on a 10-instance sample and evaluate with a GPT-4 few-shot classifier; high agreement with human judgment.

Evidence RefTable 3; Section 5

GPT-4 shows evidence of having ingested multiple dataset splits used in NLP benchmarks.

NumbersAG News train+test and WNLI train+test flagged; XSum test flagged

Practical UseDo not trust evaluations that use those contaminated splits on GPT-4; validate benchmark splits before reporting results.

Evidence RefHuman evaluation and Table 4 (Section 5)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Algorithm 2 (Guided + GPT-4 few-shot) success	GPT-4: 14/14 (100%); GPT-3.5: 13/14 (92.86%)	Human annotations on same 14 partitions	—	14 partitions across 7 datasets (train & test/valid)	Table 3; Section 5	Table 3
Algorithm 1 (ROUGE-L overlap) success	GPT-4: 13/14 (92.86%); GPT-3.5: 7/14 (50.00%)	Human annotations	—	14 partitions	Table 3; Section 5	Table 3

What To Try In 7 Days

Sample 10 examples from any benchmark split you plan to use.

Run guided instruction completions (include dataset+split+fragment) with your target LLM.

Evaluate completions with a GPT-4 few-shot prompt or compute ROUGE-L/BLEURT and compare to unguided completions via bootstrap resampling if GPT-4 is unavailable.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/shahriargolchin/time-travel-in-llms

Risks & Boundaries

Limitations

Does not distinguish contamination source or type (direct instance vs duplicated web text).

Relies on a small random sample (10 instances), which can miss sparse contamination.

When Not To Use

When you need to locate the exact origin or source of leaked data.

When you lack access to a reliable GPT-4-style classifier for decision-making.

Failure Modes

False positives when a completion coincidentally matches common web text or templates.

False negatives when the model paraphrases memorized content beyond near-exact matching.

Core Entities

Models

GPT-4gpt-4-0613GPT-3.5gpt-3.5-turbo-0613GPT-3.5 base (fine-tuned checkpoints)

Metrics

ROUGE-LBLEURT-20bootstrap resampling (10k samples)human annotation (exact/near-exact)few-shot in-context classification (GPT-4)

Datasets

AG NewsIMDBYelp Full ReviewsSAMSumXSumWNLIRTEGSM8k (used in validation)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.

GPT-4 shows evidence of having ingested multiple dataset splits used in NLP benchmarks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding