Measure and report when LLMs have seen benchmark data to avoid invalid NLP claims

October 27, 20236 min

Overview

Decision SnapshotNeeds Validation

The paper is a high-value position piece with concrete proposals (overlap and extractability) but mostly conceptual evidence; implementations and community tooling are still needed before production use.

Citations6

Evidence Strength0.60

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 0/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 60%

Authors

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre

Links

Abstract / PDF

Why It Matters For Business

If model evaluation is contaminated, product decisions and vendor comparisons can be wrong; verify exposure to benchmarks before basing choices on published scores.

Who Should Care

Summary TLDR

This position paper argues that benchmark data contamination—when a model has seen test data during its training—threatens NLP evaluation. The authors define three contamination types (guideline, raw text, annotation), show contamination can occur at pretraining, fine-tuning and post-deployment steps, and propose practical detection measures: overlap search for open models and memorization/extractability tests for closed models. They call for a community registry, tooling, and review-time checks to flag compromised results.

Problem Statement

When a model has been trained on a benchmark's test data, reported performance is inflated and scientific claims can be wrong. Data exposure can come from many sources and is hard to detect, especially for closed models, so routine evaluations may be unreliable.

Main Contribution

Clarifies three contamination types: guideline, raw text, annotation.

Maps where contamination can occur: pretraining, supervised fine-tuning, post-deployment.

Key Findings

Contamination inflates evaluated model performance and can lead to wrong scientific conclusions.

Practical UseAlways check whether a model was exposed to a benchmark before taking results at face value; if exposed, treat reported scores as upper bounds.

Evidence RefIntro and Section 1

There are three distinct contamination types: guideline, raw text, and annotation.

Practical UseAudit different exposure types separately: check for leaked annotation rules, original source text (e.g., Wikipedia), and label exposure.

Evidence RefSection 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
verbatim regeneration of benchmark examplesCoNLL-2003 first lines reproduced verbatim by multiple modelsCoNLL2003 train split (examples shown)Appendix A shows ChatGPT, WizardCoder and Copilot generating CoNLL-2003 linesAppendix A, Figures 1–3
recommended contamination measuresbenchmark data overlap for open models; extractability ratio for closed modelsSection 5.1 and 5.2 define these metrics and their useSections 5.1–5.2

What To Try In 7 Days

Run quick memorization prompts on closed models for key benchmarks (extractability test).

Search open training corpora for benchmark examples using ROOTS or Data Portraits when available.

Add a contamination check to model evaluation steps and document results in reports or PRDs.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Position paper: proposes ideas but provides limited systematic measurement results.

Detecting contamination in closed models remains manual and is currently hard to scale.

When Not To Use

Do not rely solely on memorization negative results to prove non-contamination.

Avoid treating overlap/extractability measures as definitive without reporting methodology details.

Failure Modes

False negatives: model was trained on data but does not memorize or reproduce it.

False positives: model reproduces text from mirrors or unrelated web copies without original benchmark exposure.

Core Entities

Models

GPT-3GPT-4ChatGPTLLaMALLaMA 2WizardCoderBLOOMCodexGitHub Copilot

Metrics

extractability (memorization fraction)benchmark data overlap (percentage overlap)

Datasets

CoNLL2003GSM8KMATHBIG-benchGLUESuperGLUEXNLIMultiCoNER2IMDBCNN/DailyMailC4

Benchmarks

CoNLL2003GSM8KMATHBIG-benchGLUESuperGLUEXNLI