Systematic review shows GPT-3.5/GPT-4 were exposed to ~4.7M benchmark examples and many evaluations are unfair or unreproducible

Overview

Decision SnapshotNeeds Validation

This paper is a careful, systematic literature survey with concrete counts and public registry; it is useful for policy and evaluation practice changes but not for new modeling methods.

Citations16

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 30%

Authors

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, Ondřej Dušek

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Benchmark contamination can make closed-source LLMs appear artificially better. Buyers and product teams should not trust out-of-the-box leaderboard claims for closed models without checking data provenance and evaluation parity.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors systematically reviewed 255 papers that used OpenAI's GPT-3.5/GPT-4 and found that 90 papers (≈42% of relevant work) used the web interface in ways that could expose data to OpenAI. Across those papers they document ~4.7M leaked benchmark samples from 263 datasets. They also find common evaluation problems: missing or unfair baselines, small-sample comparisons, incomplete release of prompts/code, and rare reporting of model version. The paper lists practical practices to reduce contamination and improve fairness and reproducibility.

Problem Statement

Closed-source LLMs like GPT-3.5 and GPT-4 receive user data (per OpenAI policy). Researchers often evaluate these models through the web interface and may feed benchmark examples to them. This can cause indirect data leakage (user-supplied test data becoming training data), undermining fairness of comparisons, reproducibility, and the trustworthiness of published claims.

Main Contribution

Systematic review of 255 papers evaluating GPT-3.5/GPT-4, with 212 deemed relevant after screening.

Quantification of indirect data leakage: ~4.7M benchmark samples from 263 datasets exposed via browser-based use.

Key Findings

Many published evaluations leaked data to OpenAI via the web interface.

Numbers90 papers (≈42% of relevant papers) used browser access that could be used to improve models

Practical UseAssume published GPT-3.5/GPT-4 results may be contaminated; use API/business access or opt-out, and report access method when evaluating closed models.

Evidence RefAbstract; Section 5.1

A large volume of benchmark data was exposed through research experiments.

Numbers4,714,753 samples from 263 unique datasets documented as exposed

Practical UseDo not assume a benchmark is unseen by a closed LLM; re-run baselines on the same sampled data and avoid using web interface for test data.

Evidence RefAbstract; Section 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Papers reviewed	255 total identified; 212 relevant after screening	—	—	—	Section 4 and 5	Section 5
Papers that leaked data via browser access	90 papers (~42% of relevant)	—	—	—	Section 5.1	Section 5.1

What To Try In 7 Days

Check any closed-model evaluation you rely on: did they use web interface or API? Prefer API/business plans that opt out of training.

Re-run a small key benchmark sample on both the closed model and open alternatives using the same sampled inputs.

Publish and archive prompts, exact sample IDs, model version, and evaluation dates for your internal evaluations.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://leak-llm.github.io/

Data URLs

https://leak-llm.github.io/

Risks & Boundaries

Limitations

Survey relies on publicly reported experiments and may miss unpublished leaks.

In some papers split/sample details were unclear; authors made best-effort assumptions when not clarified.

When Not To Use

Do not use this paper as evidence about contamination of individual closed-model training corpora prior to Nov 2022.

Do not use it to infer the exact performance impact of leaks on model scores without follow-up experiments.

Failure Modes

Under-counting: unreported experiments could hide more leaked data.

Assumption errors: when split/sample unspecified, assumed whole split used which may overestimate leaked volume.

Core Entities

Models

GPT-3.5GPT-4ChatGPT (combined reference to web interface models)

Metrics

counts of leaked samplesleak severity (% of split exposed)presence/absence of reproducibility artifacts

Datasets

SAMSumMultiWOZ 2.4Semeval2016 Task 6GSM8KSQuALITYmany others (263 datasets total documented)

Benchmarks

question answering suitesnatural language inference suitesNLG benchmarksdialogue benchmarks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Many published evaluations leaked data to OpenAI via the web interface.

A large volume of benchmark data was exposed through research experiments.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding