Overview
This paper is a careful, systematic literature survey with concrete counts and public registry; it is useful for policy and evaluation practice changes but not for new modeling methods.
Citations16
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 30%
Why It Matters For Business
Benchmark contamination can make closed-source LLMs appear artificially better. Buyers and product teams should not trust out-of-the-box leaderboard claims for closed models without checking data provenance and evaluation parity.
Who Should Care
Summary TLDR
The authors systematically reviewed 255 papers that used OpenAI's GPT-3.5/GPT-4 and found that 90 papers (≈42% of relevant work) used the web interface in ways that could expose data to OpenAI. Across those papers they document ~4.7M leaked benchmark samples from 263 datasets. They also find common evaluation problems: missing or unfair baselines, small-sample comparisons, incomplete release of prompts/code, and rare reporting of model version. The paper lists practical practices to reduce contamination and improve fairness and reproducibility.
Problem Statement
Closed-source LLMs like GPT-3.5 and GPT-4 receive user data (per OpenAI policy). Researchers often evaluate these models through the web interface and may feed benchmark examples to them. This can cause indirect data leakage (user-supplied test data becoming training data), undermining fairness of comparisons, reproducibility, and the trustworthiness of published claims.
Main Contribution
Systematic review of 255 papers evaluating GPT-3.5/GPT-4, with 212 deemed relevant after screening.
Quantification of indirect data leakage: ~4.7M benchmark samples from 263 datasets exposed via browser-based use.
Key Findings
Many published evaluations leaked data to OpenAI via the web interface.
A large volume of benchmark data was exposed through research experiments.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Papers reviewed | 255 total identified; 212 relevant after screening | — | — | — | Section 4 and 5 | Section 5 |
| Papers that leaked data via browser access | 90 papers (~42% of relevant) | — | — | — | Section 5.1 | Section 5.1 |
What To Try In 7 Days
Check any closed-model evaluation you rely on: did they use web interface or API? Prefer API/business plans that opt out of training.
Re-run a small key benchmark sample on both the closed model and open alternatives using the same sampled inputs.
Publish and archive prompts, exact sample IDs, model version, and evaluation dates for your internal evaluations.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Survey relies on publicly reported experiments and may miss unpublished leaks.
In some papers split/sample details were unclear; authors made best-effort assumptions when not clarified.
When Not To Use
Do not use this paper as evidence about contamination of individual closed-model training corpora prior to Nov 2022.
Do not use it to infer the exact performance impact of leaks on model scores without follow-up experiments.
Failure Modes
Under-counting: unreported experiments could hide more leaked data.
Assumption errors: when split/sample unspecified, assumed whole split used which may overestimate leaked volume.

