Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
22
Why It Matters For Business
If an LLM already saw your test data, reported performance is not a real measure of capability; this cheap detection method helps teams vet benchmarks and avoid overclaiming model quality.
Summary TLDR
The paper introduces a simple, low-cost procedure to detect whether a large language model (LLM) has ingested a given dataset split. It prompts the model with a short initial fragment plus the dataset and split name ('guided instruction') and checks if the model completes the fragment exactly or nearly so. Two partition-level rules are proposed: (1) guided completions have significantly higher BLEURT/ROUGE-L overlap than unguided completions, or (2) a GPT-4 few-shot classifier flags at least one exact or two near-exact matches in a 10-instance sample. Tested on GPT-4 and GPT-3.5 across seven datasets, the best method (guided instruction + GPT-4 few-shot classifier) matched human labels in 14
Problem Statement
Benchmark scores for LLMs can be artificially high if test data or dataset instances leaked into model pretraining. Most prior leakage checks need access to pretraining data or heavy compute. The paper asks: can we detect dataset contamination automatically, cheaply, and without access to the model's pretraining corpus?
Main Contribution
A simple, instance-level contamination protocol: feed a random initial fragment + dataset name + split to an LLM (guided instruction) and ask it to finish the instance.
Two partition-level rules to label a split as contaminated: bootstrap-tested overlap improvement (guided vs general) and a GPT-4 few-shot exact/near-exact classifier on generated completions.
An evaluation on GPT-4 and GPT-3.5 across seven datasets showing the guided+GPT-4 classifier matches human judgment (92%–100%).
A controlled contamination experiment where the authors intentionally fine-tune GPT-3.5 to validate the detection rules.
Key Findings
Guided instruction + GPT-4 few-shot classifier (Algorithm 2) matches human labels nearly perfectly.
GPT-4 shows evidence of having ingested multiple dataset splits used in NLP benchmarks.
Overlap metrics can detect contamination on strong models but are inconsistent across models.
Prior heuristic (ChatGPT-Cheat?) fails on GPT-4 because safety filters produce 'suspicious' outputs.
Results
Algorithm 2 (Guided + GPT-4 few-shot) success
Algorithm 1 (ROUGE-L overlap) success
ChatGPT-Cheat? method
Controlled contamination validation
Who Should Care
What To Try In 7 Days
Sample 10 examples from any benchmark split you plan to use.
Run guided instruction completions (include dataset+split+fragment) with your target LLM.
Evaluate completions with a GPT-4 few-shot prompt or compute ROUGE-L/BLEURT and compare to unguided completions via bootstrap resampling if GPT-4 is unavailable.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Does not distinguish contamination source or type (direct instance vs duplicated web text).
- Relies on a small random sample (10 instances), which can miss sparse contamination.
- Best-performing decision rule depends on access to GPT-4 for few-shot classification.
- Overlap metrics can give inconsistent results across model families.
When Not To Use
- When you need to locate the exact origin or source of leaked data.
- When you lack access to a reliable GPT-4-style classifier for decision-making.
- For very small datasets where coincidental overlap is common.
Failure Modes
- False positives when a completion coincidentally matches common web text or templates.
- False negatives when the model paraphrases memorized content beyond near-exact matching.
- Safety filters or content redaction can produce 'suspicious' outputs that mask contamination.
- Small-sample sampling error can miss contaminated instances.
Core Entities
Models
- GPT-4
- gpt-4-0613
- GPT-3.5
- gpt-3.5-turbo-0613
- GPT-3.5 base (fine-tuned checkpoints)
Metrics
- ROUGE-L
- BLEURT-20
- bootstrap resampling (10k samples)
- human annotation (exact/near-exact)
- few-shot in-context classification (GPT-4)
Datasets
- AG News
- IMDB
- Yelp Full Reviews
- SAMSum
- XSum
- WNLI
- RTE
- GSM8k (used in validation)

