Benchmark leakage can make small LLMs look much stronger — avoid training on test or prompt data

Overview

Decision SnapshotReady For Pilot

The paper provides controlled continual-training experiments across multiple models and benchmarks showing consistent inflation and side effects; results are strong for the studied settings but do not include full pretraining contamination scenarios.

Citations16

Evidence Strength0.80

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han

Links

Abstract / PDF

Why It Matters For Business

Contaminated training data can make models look better on paper but worse in real tasks; check overlap and report contamination to avoid bad product decisions.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

The paper shows that if evaluation data (training sets, test prompts, or test examples) leaks into model training, reported benchmark scores can jump dramatically without real capability gains. In controlled experiments, repeatedly training small LLMs on leaked benchmark data raised accuracy by tens of points on many tasks and let 1–3B models beat much larger models on benchmarks. Leakage also harms unrelated tasks (summarization, code) and reduces gains from later instruction tuning. The authors recommend systematic contamination checks (e.g., 13-gram overlap), publishing overlap reports, and averaging results across multiple prompts.

Problem Statement

Public benchmarks are used to claim LLM progress, but pretraining or fine-tuning can accidentally include benchmark data. That contamination inflates scores, breaks zero/few-shot assumptions, misleads comparisons and leaderboards, and can harm real-world performance when models are later adapted.

Main Contribution

Defined three practical leakage modes: training-set leakage, test-prompt leakage, and full test-set+prompt leakage.

Empirical study: continually training 1.3B–7B models on leaked benchmark data and measuring effects across MMLU, QA, reasoning, reading comprehension, summarization, and code tasks.

Key Findings

Leaking training & test data greatly inflates benchmark scores.

Numbersphi-1.5 MMLU: 42.87 -> 75.05 after full leak (Table 1)

Practical UseDo not include benchmark train/test/prompts in training data; otherwise reported gains can be artefacts.

Evidence RefTable 1

Even small models can outperform much larger ones when leaked data is used.

NumbersOpenLLaMA-3B MMLU 26.49 -> 87.31 with full leak; LLaMA-65B baseline >50 (Table 1)

Practical UseLeaderboard rank can be meaningless if some submissions used leaked data; require contamination reports.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	phi-1.5: 42.87 -> 75.05 (None -> All Train+Test P&S)	42.87 (phi-1.5 none)	+32.18	MMLU	Table 1 row for phi-1.5	Table 1
Accuracy	LLaMA-2 (7B): 42.95 -> 96.34 (None -> All Train+Test P&S)	42.95 (LLaMA-2 none)	+53.39	MMLU	Table 1 row for LLaMA-2 7B	Table 1

What To Try In 7 Days

Run a 13-gram overlap check between your pretraining/finetuning data and common benchmarks.

Require contamination-overlap statistics as part of model evaluation reports.

When comparing models, evaluate on at least three diverse benchmarks including generation and code tasks.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Experiments use continual training on benchmarks rather than injecting contamination into full pretraining; full-pretraining effects may differ.

Did not explore partial-leakage fractions or label-less leaks; leakage proportion effects are untested.

When Not To Use

To claim general LLM improvements when training data overlap with evaluated benchmarks is unknown.

As the sole evidence of model capability without cross-task validation.

Failure Modes

Undetected data contamination leads to inflated benchmark scores and misleading model selection.

Overfitting to benchmark style reduces performance on unrelated tasks and harms adaptation.

Core Entities

Models

GPT-Neo-1.3Bphi-1.5 (1.3B)OpenLLaMA-3BLLaMA-2-7BLLaMA-13BLLaMA-30BLLaMA-65B

Metrics

AccuracyROUGE-Lpass@10zero-shotfew-shot

Datasets

MMLUBoolQPIQAHellaSwagWinoGrandeARC-EasyARC-ChallengeOpenBookQACommonsenseQAGSM8kAQuARACE-MiddleRACE-HighCoQACMRC2018C3-DialogLAMBADAXSumHumanEval

Benchmarks

MMLUBig-BenchAGIEvalOpenCompassC-Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Leaking training & test data greatly inflates benchmark scores.

Even small models can outperform much larger ones when leaked data is used.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding