Systematic review shows GPT-3.5/GPT-4 were exposed to ~4.7M benchmark examples and many evaluations are unfair or unreproducible

February 6, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

16

Authors

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, Ondřej Dušek

Links

Abstract / PDF

Why It Matters For Business

Benchmark contamination can make closed-source LLMs appear artificially better. Buyers and product teams should not trust out-of-the-box leaderboard claims for closed models without checking data provenance and evaluation parity.

Summary TLDR

The authors systematically reviewed 255 papers that used OpenAI's GPT-3.5/GPT-4 and found that 90 papers (≈42% of relevant work) used the web interface in ways that could expose data to OpenAI. Across those papers they document ~4.7M leaked benchmark samples from 263 datasets. They also find common evaluation problems: missing or unfair baselines, small-sample comparisons, incomplete release of prompts/code, and rare reporting of model version. The paper lists practical practices to reduce contamination and improve fairness and reproducibility.

Problem Statement

Closed-source LLMs like GPT-3.5 and GPT-4 receive user data (per OpenAI policy). Researchers often evaluate these models through the web interface and may feed benchmark examples to them. This can cause indirect data leakage (user-supplied test data becoming training data), undermining fairness of comparisons, reproducibility, and the trustworthiness of published claims.

Main Contribution

Systematic review of 255 papers evaluating GPT-3.5/GPT-4, with 212 deemed relevant after screening.

Quantification of indirect data leakage: ~4.7M benchmark samples from 263 datasets exposed via browser-based use.

Analysis of evaluation malpractices: unfair comparisons, missing baselines, incomplete reproducibility artifacts, and poor reporting of model versions.

Practical checklist of suggested practices for evaluating closed-source LLMs and a public collaborative leak registry at https://leak-llm.github.io/.

Key Findings

Many published evaluations leaked data to OpenAI via the web interface.

Numbers90 papers (≈42% of relevant papers) used browser access that could be used to improve models

A large volume of benchmark data was exposed through research experiments.

Numbers4,714,753 samples from 263 unique datasets documented as exposed

Most leaked datasets were exposed almost entirely.

Numbers142 of 263 datasets (~53%) classified as high leak (>95% of split exposed)

Reproducibility artifacts are unevenly shared.

Numbers192/212 (~91%) papers reported prompts; 113/212 (~53%) provided working code/data repos; model version reported in 29/70

Comparisons are often missing or unfair and many evaluations are underpowered.

Numbers~50% of pre-prints (71/142) and ~43% of peer-reviewed papers (30/70) lacked open-model comparisons; many ChatGPT tests:

Results

Papers reviewed

Value255 total identified; 212 relevant after screening

Papers that leaked data via browser access

Value90 papers (~42% of relevant)

Leaked samples documented

Value~4,714,753 samples

Leak severity distribution (datasets)

ValueLow:66; Moderate-low:47; Moderate-high:10; High:142

Prompts published by papers

Value192/212 (~91%) reported prompts

Code/data repository provided

Value113/212 (~53%) provided usable repos

Peer-reviewed vs pre-prints

Value70 peer-reviewed; 142 pre-prints among relevant papers

Who Should Care

What To Try In 7 Days

Check any closed-model evaluation you rely on: did they use web interface or API? Prefer API/business plans that opt out of training.

Re-run a small key benchmark sample on both the closed model and open alternatives using the same sampled inputs.

Publish and archive prompts, exact sample IDs, model version, and evaluation dates for your internal evaluations.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey relies on publicly reported experiments and may miss unpublished leaks.
  • In some papers split/sample details were unclear; authors made best-effort assumptions when not clarified.
  • Focus limited to OpenAI's ChatGPT/GPT-3.5/GPT-4 and to the review period ending Oct 2023.

When Not To Use

  • Do not use this paper as evidence about contamination of individual closed-model training corpora prior to Nov 2022.
  • Do not use it to infer the exact performance impact of leaks on model scores without follow-up experiments.

Failure Modes

  • Under-counting: unreported experiments could hide more leaked data.
  • Assumption errors: when split/sample unspecified, assumed whole split used which may overestimate leaked volume.
  • Time sensitivity: vendor policies and model update schedules change, so registry can go out of date.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • ChatGPT (combined reference to web interface models)

Metrics

  • counts of leaked samples
  • leak severity (% of split exposed)
  • presence/absence of reproducibility artifacts

Datasets

  • SAMSum
  • MultiWOZ 2.4
  • Semeval2016 Task 6
  • GSM8K
  • SQuALITY
  • many others (263 datasets total documented)

Benchmarks

  • question answering suites
  • natural language inference suites
  • NLG benchmarks
  • dialogue benchmarks