Systematic review shows GPT-3.5/GPT-4 were exposed to ~4.7M benchmark examples and many evaluations are unfair or unreproducible

February 6, 20247 min

Overview

Decision SnapshotNeeds Validation

This paper is a careful, systematic literature survey with concrete counts and public registry; it is useful for policy and evaluation practice changes but not for new modeling methods.

Citations16

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 30%

Authors

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, Ondřej Dušek

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Benchmark contamination can make closed-source LLMs appear artificially better. Buyers and product teams should not trust out-of-the-box leaderboard claims for closed models without checking data provenance and evaluation parity.

Who Should Care

Summary TLDR

The authors systematically reviewed 255 papers that used OpenAI's GPT-3.5/GPT-4 and found that 90 papers (≈42% of relevant work) used the web interface in ways that could expose data to OpenAI. Across those papers they document ~4.7M leaked benchmark samples from 263 datasets. They also find common evaluation problems: missing or unfair baselines, small-sample comparisons, incomplete release of prompts/code, and rare reporting of model version. The paper lists practical practices to reduce contamination and improve fairness and reproducibility.

Problem Statement

Closed-source LLMs like GPT-3.5 and GPT-4 receive user data (per OpenAI policy). Researchers often evaluate these models through the web interface and may feed benchmark examples to them. This can cause indirect data leakage (user-supplied test data becoming training data), undermining fairness of comparisons, reproducibility, and the trustworthiness of published claims.

Main Contribution

Systematic review of 255 papers evaluating GPT-3.5/GPT-4, with 212 deemed relevant after screening.

Quantification of indirect data leakage: ~4.7M benchmark samples from 263 datasets exposed via browser-based use.

Key Findings

Many published evaluations leaked data to OpenAI via the web interface.

Numbers90 papers (≈42% of relevant papers) used browser access that could be used to improve models

Practical UseAssume published GPT-3.5/GPT-4 results may be contaminated; use API/business access or opt-out, and report access method when evaluating closed models.

Evidence RefAbstract; Section 5.1

A large volume of benchmark data was exposed through research experiments.

Numbers4,714,753 samples from 263 unique datasets documented as exposed

Practical UseDo not assume a benchmark is unseen by a closed LLM; re-run baselines on the same sampled data and avoid using web interface for test data.

Evidence RefAbstract; Section 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Papers reviewed255 total identified; 212 relevant after screeningSection 4 and 5Section 5
Papers that leaked data via browser access90 papers (~42% of relevant)Section 5.1Section 5.1

What To Try In 7 Days

Check any closed-model evaluation you rely on: did they use web interface or API? Prefer API/business plans that opt out of training.

Re-run a small key benchmark sample on both the closed model and open alternatives using the same sampled inputs.

Publish and archive prompts, exact sample IDs, model version, and evaluation dates for your internal evaluations.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey relies on publicly reported experiments and may miss unpublished leaks.

In some papers split/sample details were unclear; authors made best-effort assumptions when not clarified.

When Not To Use

Do not use this paper as evidence about contamination of individual closed-model training corpora prior to Nov 2022.

Do not use it to infer the exact performance impact of leaks on model scores without follow-up experiments.

Failure Modes

Under-counting: unreported experiments could hide more leaked data.

Assumption errors: when split/sample unspecified, assumed whole split used which may overestimate leaked volume.

Core Entities

Models

GPT-3.5GPT-4ChatGPT (combined reference to web interface models)

Metrics

counts of leaked samplesleak severity (% of split exposed)presence/absence of reproducibility artifacts

Datasets

SAMSumMultiWOZ 2.4Semeval2016 Task 6GSM8KSQuALITYmany others (263 datasets total documented)

Benchmarks

question answering suitesnatural language inference suitesNLG benchmarksdialogue benchmarks