Overview
Production Readiness
0.4
Novelty Score
0.3
Cost Impact Score
0.6
Citation Count
16
Why It Matters For Business
Benchmark contamination can make closed-source LLMs appear artificially better. Buyers and product teams should not trust out-of-the-box leaderboard claims for closed models without checking data provenance and evaluation parity.
Summary TLDR
The authors systematically reviewed 255 papers that used OpenAI's GPT-3.5/GPT-4 and found that 90 papers (≈42% of relevant work) used the web interface in ways that could expose data to OpenAI. Across those papers they document ~4.7M leaked benchmark samples from 263 datasets. They also find common evaluation problems: missing or unfair baselines, small-sample comparisons, incomplete release of prompts/code, and rare reporting of model version. The paper lists practical practices to reduce contamination and improve fairness and reproducibility.
Problem Statement
Closed-source LLMs like GPT-3.5 and GPT-4 receive user data (per OpenAI policy). Researchers often evaluate these models through the web interface and may feed benchmark examples to them. This can cause indirect data leakage (user-supplied test data becoming training data), undermining fairness of comparisons, reproducibility, and the trustworthiness of published claims.
Main Contribution
Systematic review of 255 papers evaluating GPT-3.5/GPT-4, with 212 deemed relevant after screening.
Quantification of indirect data leakage: ~4.7M benchmark samples from 263 datasets exposed via browser-based use.
Analysis of evaluation malpractices: unfair comparisons, missing baselines, incomplete reproducibility artifacts, and poor reporting of model versions.
Practical checklist of suggested practices for evaluating closed-source LLMs and a public collaborative leak registry at https://leak-llm.github.io/.
Key Findings
Many published evaluations leaked data to OpenAI via the web interface.
A large volume of benchmark data was exposed through research experiments.
Most leaked datasets were exposed almost entirely.
Reproducibility artifacts are unevenly shared.
Comparisons are often missing or unfair and many evaluations are underpowered.
Results
Papers reviewed
Papers that leaked data via browser access
Leaked samples documented
Leak severity distribution (datasets)
Prompts published by papers
Code/data repository provided
Peer-reviewed vs pre-prints
Who Should Care
What To Try In 7 Days
Check any closed-model evaluation you rely on: did they use web interface or API? Prefer API/business plans that opt out of training.
Re-run a small key benchmark sample on both the closed model and open alternatives using the same sampled inputs.
Publish and archive prompts, exact sample IDs, model version, and evaluation dates for your internal evaluations.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey relies on publicly reported experiments and may miss unpublished leaks.
- In some papers split/sample details were unclear; authors made best-effort assumptions when not clarified.
- Focus limited to OpenAI's ChatGPT/GPT-3.5/GPT-4 and to the review period ending Oct 2023.
When Not To Use
- Do not use this paper as evidence about contamination of individual closed-model training corpora prior to Nov 2022.
- Do not use it to infer the exact performance impact of leaks on model scores without follow-up experiments.
Failure Modes
- Under-counting: unreported experiments could hide more leaked data.
- Assumption errors: when split/sample unspecified, assumed whole split used which may overestimate leaked volume.
- Time sensitivity: vendor policies and model update schedules change, so registry can go out of date.
Core Entities
Models
- GPT-3.5
- GPT-4
- ChatGPT (combined reference to web interface models)
Metrics
- counts of leaked samples
- leak severity (% of split exposed)
- presence/absence of reproducibility artifacts
Datasets
- SAMSum
- MultiWOZ 2.4
- Semeval2016 Task 6
- GSM8K
- SQuALITY
- many others (263 datasets total documented)
Benchmarks
- question answering suites
- natural language inference suites
- NLG benchmarks
- dialogue benchmarks

