Measure and report when LLMs have seen benchmark data to avoid invalid NLP claims

October 27, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

6

Authors

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre

Links

Abstract / PDF

Why It Matters For Business

If model evaluation is contaminated, product decisions and vendor comparisons can be wrong; verify exposure to benchmarks before basing choices on published scores.

Summary TLDR

This position paper argues that benchmark data contamination—when a model has seen test data during its training—threatens NLP evaluation. The authors define three contamination types (guideline, raw text, annotation), show contamination can occur at pretraining, fine-tuning and post-deployment steps, and propose practical detection measures: overlap search for open models and memorization/extractability tests for closed models. They call for a community registry, tooling, and review-time checks to flag compromised results.

Problem Statement

When a model has been trained on a benchmark's test data, reported performance is inflated and scientific claims can be wrong. Data exposure can come from many sources and is hard to detect, especially for closed models, so routine evaluations may be unreliable.

Main Contribution

Clarifies three contamination types: guideline, raw text, annotation.

Maps where contamination can occur: pretraining, supervised fine-tuning, post-deployment.

Proposes measurable signals: benchmark-data overlap for open models and extractability/memorization tests for closed models.

Calls for a public registry of contamination cases and changes in peer review and reporting.

Key Findings

Contamination inflates evaluated model performance and can lead to wrong scientific conclusions.

There are three distinct contamination types: guideline, raw text, and annotation.

For closed models, contamination can be measured by extractability (memorization): fraction of examples a model reproduces when prompted.

Results

verbatim regeneration of benchmark examples

ValueCoNLL-2003 first lines reproduced verbatim by multiple models

recommended contamination measures

Valuebenchmark data overlap for open models; extractability ratio for closed models

Who Should Care

What To Try In 7 Days

Run quick memorization prompts on closed models for key benchmarks (extractability test).

Search open training corpora for benchmark examples using ROOTS or Data Portraits when available.

Add a contamination check to model evaluation steps and document results in reports or PRDs.

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Position paper: proposes ideas but provides limited systematic measurement results.
  • Detecting contamination in closed models remains manual and is currently hard to scale.
  • Registry and tooling require community coordination and sustained effort.

When Not To Use

  • Do not rely solely on memorization negative results to prove non-contamination.
  • Avoid treating overlap/extractability measures as definitive without reporting methodology details.

Failure Modes

  • False negatives: model was trained on data but does not memorize or reproduce it.
  • False positives: model reproduces text from mirrors or unrelated web copies without original benchmark exposure.
  • Incomplete evidence: partial overlap may not indicate full contamination of test splits.

Core Entities

Models

  • GPT-3
  • GPT-4
  • ChatGPT
  • LLaMA
  • LLaMA 2
  • WizardCoder
  • BLOOM
  • Codex
  • GitHub Copilot

Metrics

  • extractability (memorization fraction)
  • benchmark data overlap (percentage overlap)

Datasets

  • CoNLL2003
  • GSM8K
  • MATH
  • BIG-bench
  • GLUE
  • SuperGLUE
  • XNLI
  • MultiCoNER2
  • IMDB
  • CNN/DailyMail
  • C4

Benchmarks

  • CoNLL2003
  • GSM8K
  • MATH
  • BIG-bench
  • GLUE
  • SuperGLUE
  • XNLI