Benchmark leakage can make small LLMs look much stronger — avoid training on test or prompt data

November 3, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

16

Authors

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han

Links

Abstract / PDF

Why It Matters For Business

Contaminated training data can make models look better on paper but worse in real tasks; check overlap and report contamination to avoid bad product decisions.

Summary TLDR

The paper shows that if evaluation data (training sets, test prompts, or test examples) leaks into model training, reported benchmark scores can jump dramatically without real capability gains. In controlled experiments, repeatedly training small LLMs on leaked benchmark data raised accuracy by tens of points on many tasks and let 1–3B models beat much larger models on benchmarks. Leakage also harms unrelated tasks (summarization, code) and reduces gains from later instruction tuning. The authors recommend systematic contamination checks (e.g., 13-gram overlap), publishing overlap reports, and averaging results across multiple prompts.

Problem Statement

Public benchmarks are used to claim LLM progress, but pretraining or fine-tuning can accidentally include benchmark data. That contamination inflates scores, breaks zero/few-shot assumptions, misleads comparisons and leaderboards, and can harm real-world performance when models are later adapted.

Main Contribution

Defined three practical leakage modes: training-set leakage, test-prompt leakage, and full test-set+prompt leakage.

Empirical study: continually training 1.3B–7B models on leaked benchmark data and measuring effects across MMLU, QA, reasoning, reading comprehension, summarization, and code tasks.

Observed side effects: inflated benchmark scores, worse performance on unrelated tasks, and reduced adaptation benefits from instruction tuning.

Practical checklist of recommendations for model developers and benchmark maintainers (data decontamination, report overlaps, multiple prompts, contamination reports).

Key Findings

Leaking training & test data greatly inflates benchmark scores.

Numbersphi-1.5 MMLU: 42.87 -> 75.05 after full leak (Table 1)

Even small models can outperform much larger ones when leaked data is used.

NumbersOpenLLaMA-3B MMLU 26.49 -> 87.31 with full leak; LLaMA-65B baseline >50 (Table 1)

Leakage can reduce performance on unrelated real tasks.

NumbersXSum ROUGE-L: OpenLLaMA-3B 8.31 -> 0.19 after leak; LLaMA-2 HEval 26.83 -> 8.54 (Table 3)

Leakage weakens later adaptation gains from instruction tuning.

NumbersInstruction tuning yields ~80% of HumanEval improvement when model pre-trained on leaked data (Table 4)

Leaked test prompts alone provide a big advantage.

Numbersphi-1.5 +All Train S+Test P outperforms LLaMA-65B on RACE-M (55.80 vs 53.00) (Table 2)

Results

Accuracy

Valuephi-1.5: 42.87 -> 75.05 (None -> All Train+Test P&S)

Baseline42.87 (phi-1.5 none)

Accuracy

ValueLLaMA-2 (7B): 42.95 -> 96.34 (None -> All Train+Test P&S)

Baseline42.95 (LLaMA-2 none)

XSum (ROUGE-L)

ValueOpenLLaMA-3B: 8.31 -> 0.19 (None -> Leak)

Baseline8.31 (OpenLLaMA none)

HumanEval (pass@10)

ValueLLaMA-2 (7B): 26.83 -> 8.54 (None -> Leak)

Baseline26.83 (LLaMA-2 none)

Accuracy

Valuephi-1.5 (1.3B) RACE-M: 41.71 -> 79.28 (None -> All Train+Test P&S)

Baseline41.71 (phi-1.5 none)

Who Should Care

What To Try In 7 Days

Run a 13-gram overlap check between your pretraining/finetuning data and common benchmarks.

Require contamination-overlap statistics as part of model evaluation reports.

When comparing models, evaluate on at least three diverse benchmarks including generation and code tasks.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments use continual training on benchmarks rather than injecting contamination into full pretraining; full-pretraining effects may differ.
  • Did not explore partial-leakage fractions or label-less leaks; leakage proportion effects are untested.
  • No systematic measurement of contamination degrees between mainstream pretraining corpora and benchmarks.

When Not To Use

  • To claim general LLM improvements when training data overlap with evaluated benchmarks is unknown.
  • As the sole evidence of model capability without cross-task validation.

Failure Modes

  • Undetected data contamination leads to inflated benchmark scores and misleading model selection.
  • Overfitting to benchmark style reduces performance on unrelated tasks and harms adaptation.
  • Prompt-sensitive evaluations can be gamed if prompts leak into training.

Core Entities

Models

  • GPT-Neo-1.3B
  • phi-1.5 (1.3B)
  • OpenLLaMA-3B
  • LLaMA-2-7B
  • LLaMA-13B
  • LLaMA-30B
  • LLaMA-65B

Metrics

  • Accuracy
  • ROUGE-L
  • pass@10
  • zero-shot
  • few-shot

Datasets

  • MMLU
  • BoolQ
  • PIQA
  • HellaSwag
  • WinoGrande
  • ARC-Easy
  • ARC-Challenge
  • OpenBookQA
  • CommonsenseQA
  • GSM8k
  • AQuA
  • RACE-Middle
  • RACE-High
  • CoQA
  • CMRC2018
  • C3-Dialog
  • LAMBADA
  • XSum
  • HumanEval

Benchmarks

  • MMLU
  • Big-Bench
  • AGIEval
  • OpenCompass
  • C-Eval