Models can memorize benchmarks in other languages and still cheat English leaderboards

Overview

Decision SnapshotNeeds Validation

The method is simple and effective for multiple-choice benchmarks and flagged contamination in experiments; it is less decisive on numeric-choice tasks and was evaluated on 7B models only.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 45%

Novelty: 70%

Authors

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Benchmarks can be silently leaked across languages, inflating model claims. Audit multilingual training data and use generalization checks before productizing a model.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors show a stealthy form of benchmark leakage: continually pretraining multilingual LLMs on translated test sets (cross-lingual contamination) inflates English benchmark scores while evading common text-overlap detectors. They propose a simple, generalization-based test—replace wrong choices with correct answers from other questions (choice confusion)—which reveals non-generalizable memorization. Experiments on LLaMA3-8B and Qwen1.5-7B across MMLU, ARC-Challenge and MathQA show sizable score inflation from cross-lingual contamination and that standard memorization checks often miss it. Code and data are provided.

Problem Statement

Current contamination checks look for direct text overlap. But models can memorize benchmark knowledge via translations or transformed forms and still perform well on the original language. This cross-lingual memorization inflates reported performance and escapes overlap-based detectors. We need detection that tests whether high scores reflect generalizable ability or just memorized, non-generalizable knowledge.

Main Contribution

Identify cross-lingual contamination: overfitting models on translated test sets can inflate English scores but hide from overlap checks.

Propose a generalization-based detector (choice confusion): measure performance change when wrong choices are replaced by correct choices from other questions.

Key Findings

Cross-lingual contamination raises benchmark scores substantially.

NumbersLLaMA3-8B MMLU: 63.82% → 80.62% (Spanish)

Practical UseDo not trust high leaderboard scores without contamination checks across languages; translated-pretraining can boost apparent ability by ~10–35% on evaluated tasks.

Evidence RefTable 1

Vanilla contamination (English) drives near-perfect scores but cross-lingual also inflates performance.

NumbersVanilla LLaMA3-8B MMLU: 98.01%; Qwen1.5-7B MMLU: 97.89%

Practical UseIf a model scores near 100% on a test set, suspect direct memorization; but moderate-to-high gains (70s–80s%) can still indicate stealthy contamination.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80.62%	63.82% (clean)	+16.80	LLaMA3-8B on MMLU (Spanish cross-lingual contaminated)	Cross-lingual contamination raised LLaMA3-8B MMLU to 80.62% from 63.82%	Table 1
Accuracy	97.89%	60.09% (clean)	+37.80	Qwen1.5-7B on MMLU (vanilla contaminated)	Direct (English) contamination nearly memorizes test sets, pushing accuracy near 98%	Table 1

What To Try In 7 Days

Run the paper's choice-confusion test: create a generalized benchmark by swapping wrong choices with other questions' correct answers and measure performance gap.

Run n-gram and shared-likelihood checks but treat negatives as weak evidence against cross-lingual leakage.

Scan pretraining inputs for translated benchmark content and log language-distribution of specialized corpora.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ShangDataLab/Deep-Contam

Data URLs

public benchmark links (MMLU, ARC, MathQA) and translated sets via repo

Risks & Boundaries

Limitations

Experiments use only 7B models; transfer to larger or smaller sizes is untested.

All target benchmarks are multiple-choice; results may not apply to open-ended tasks.

When Not To Use

Do not apply choice-confusion as-is to open-ended generation tasks without adapting the generalized test.

Avoid interpreting small generalized gains on numeric-choice datasets (MathQA) as clean behavior.

Failure Modes

Numeric-only choice sets (e.g., MathQA) reduce the power of choice-confusion detection.

Low-quality translations can reduce contamination effect and confound detection.

Core Entities

Models

LLaMA3-8BQwen1.5-7B

Metrics

Accuracygeneralized-vs-original differenceshared likelihood p-value

Datasets

MMLUARC ChallengeMathQA

Benchmarks

MMLUARC ChallengeMathQA

Context Entities

Models

Phi2-2.7BPhi3-3.8BAbel-7BGLM4-9BLLaMA3-70BReflection-70B

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Cross-lingual contamination raises benchmark scores substantially.

Vanilla contamination (English) drives near-perfect scores but cross-lingual also inflates performance.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Key finding