Benchmark leakage can make small LLMs look much stronger — avoid training on test or prompt data

November 3, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper provides controlled continual-training experiments across multiple models and benchmarks showing consistent inflation and side effects; results are strong for the studied settings but do not include full pretraining contamination scenarios.

Citations16

Evidence Strength0.80

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, Jiawei Han

Links

Abstract / PDF

Why It Matters For Business

Contaminated training data can make models look better on paper but worse in real tasks; check overlap and report contamination to avoid bad product decisions.

Who Should Care

Summary TLDR

The paper shows that if evaluation data (training sets, test prompts, or test examples) leaks into model training, reported benchmark scores can jump dramatically without real capability gains. In controlled experiments, repeatedly training small LLMs on leaked benchmark data raised accuracy by tens of points on many tasks and let 1–3B models beat much larger models on benchmarks. Leakage also harms unrelated tasks (summarization, code) and reduces gains from later instruction tuning. The authors recommend systematic contamination checks (e.g., 13-gram overlap), publishing overlap reports, and averaging results across multiple prompts.

Problem Statement

Public benchmarks are used to claim LLM progress, but pretraining or fine-tuning can accidentally include benchmark data. That contamination inflates scores, breaks zero/few-shot assumptions, misleads comparisons and leaderboards, and can harm real-world performance when models are later adapted.

Main Contribution

Defined three practical leakage modes: training-set leakage, test-prompt leakage, and full test-set+prompt leakage.

Empirical study: continually training 1.3B–7B models on leaked benchmark data and measuring effects across MMLU, QA, reasoning, reading comprehension, summarization, and code tasks.

Key Findings

Leaking training & test data greatly inflates benchmark scores.

Numbersphi-1.5 MMLU: 42.87 -> 75.05 after full leak (Table 1)

Practical UseDo not include benchmark train/test/prompts in training data; otherwise reported gains can be artefacts.

Evidence RefTable 1

Even small models can outperform much larger ones when leaked data is used.

NumbersOpenLLaMA-3B MMLU 26.49 -> 87.31 with full leak; LLaMA-65B baseline >50 (Table 1)

Practical UseLeaderboard rank can be meaningless if some submissions used leaked data; require contamination reports.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyphi-1.5: 42.87 -> 75.05 (None -> All Train+Test P&S)42.87 (phi-1.5 none)+32.18MMLUTable 1 row for phi-1.5Table 1
AccuracyLLaMA-2 (7B): 42.95 -> 96.34 (None -> All Train+Test P&S)42.95 (LLaMA-2 none)+53.39MMLUTable 1 row for LLaMA-2 7BTable 1

What To Try In 7 Days

Run a 13-gram overlap check between your pretraining/finetuning data and common benchmarks.

Require contamination-overlap statistics as part of model evaluation reports.

When comparing models, evaluate on at least three diverse benchmarks including generation and code tasks.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use continual training on benchmarks rather than injecting contamination into full pretraining; full-pretraining effects may differ.

Did not explore partial-leakage fractions or label-less leaks; leakage proportion effects are untested.

When Not To Use

To claim general LLM improvements when training data overlap with evaluated benchmarks is unknown.

As the sole evidence of model capability without cross-task validation.

Failure Modes

Undetected data contamination leads to inflated benchmark scores and misleading model selection.

Overfitting to benchmark style reduces performance on unrelated tasks and harms adaptation.

Core Entities

Models

GPT-Neo-1.3Bphi-1.5 (1.3B)OpenLLaMA-3BLLaMA-2-7BLLaMA-13BLLaMA-30BLLaMA-65B

Metrics

AccuracyROUGE-Lpass@10zero-shotfew-shot

Datasets

MMLUBoolQPIQAHellaSwagWinoGrandeARC-EasyARC-ChallengeOpenBookQACommonsenseQAGSM8kAQuARACE-MiddleRACE-HighCoQACMRC2018C3-DialogLAMBADAXSumHumanEval

Benchmarks

MMLUBig-BenchAGIEvalOpenCompassC-Eval