Models can memorize benchmarks in other languages and still cheat English leaderboards

June 19, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is simple and effective for multiple-choice benchmarks and flagged contamination in experiments; it is less decisive on numeric-choice tasks and was evaluated on 7B models only.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 45%

Novelty: 70%

Authors

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Benchmarks can be silently leaked across languages, inflating model claims. Audit multilingual training data and use generalization checks before productizing a model.

Who Should Care

Summary TLDR

The authors show a stealthy form of benchmark leakage: continually pretraining multilingual LLMs on translated test sets (cross-lingual contamination) inflates English benchmark scores while evading common text-overlap detectors. They propose a simple, generalization-based test—replace wrong choices with correct answers from other questions (choice confusion)—which reveals non-generalizable memorization. Experiments on LLaMA3-8B and Qwen1.5-7B across MMLU, ARC-Challenge and MathQA show sizable score inflation from cross-lingual contamination and that standard memorization checks often miss it. Code and data are provided.

Problem Statement

Current contamination checks look for direct text overlap. But models can memorize benchmark knowledge via translations or transformed forms and still perform well on the original language. This cross-lingual memorization inflates reported performance and escapes overlap-based detectors. We need detection that tests whether high scores reflect generalizable ability or just memorized, non-generalizable knowledge.

Main Contribution

Identify cross-lingual contamination: overfitting models on translated test sets can inflate English scores but hide from overlap checks.

Propose a generalization-based detector (choice confusion): measure performance change when wrong choices are replaced by correct choices from other questions.

Key Findings

Cross-lingual contamination raises benchmark scores substantially.

NumbersLLaMA3-8B MMLU: 63.82%80.62% (Spanish)

Practical UseDo not trust high leaderboard scores without contamination checks across languages; translated-pretraining can boost apparent ability by ~10–35% on evaluated tasks.

Evidence RefTable 1

Vanilla contamination (English) drives near-perfect scores but cross-lingual also inflates performance.

NumbersVanilla LLaMA3-8B MMLU: 98.01%; Qwen1.5-7B MMLU: 97.89%

Practical UseIf a model scores near 100% on a test set, suspect direct memorization; but moderate-to-high gains (70s–80s%) can still indicate stealthy contamination.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80.62%63.82% (clean)+16.80LLaMA3-8B on MMLU (Spanish cross-lingual contaminated)Cross-lingual contamination raised LLaMA3-8B MMLU to 80.62% from 63.82%Table 1
Accuracy97.89%60.09% (clean)+37.80Qwen1.5-7B on MMLU (vanilla contaminated)Direct (English) contamination nearly memorizes test sets, pushing accuracy near 98%Table 1

What To Try In 7 Days

Run the paper's choice-confusion test: create a generalized benchmark by swapping wrong choices with other questions' correct answers and measure performance gap.

Run n-gram and shared-likelihood checks but treat negatives as weak evidence against cross-lingual leakage.

Scan pretraining inputs for translated benchmark content and log language-distribution of specialized corpora.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

public benchmark links (MMLU, ARC, MathQA) and translated sets via repo

Risks & Boundaries

Limitations

Experiments use only 7B models; transfer to larger or smaller sizes is untested.

All target benchmarks are multiple-choice; results may not apply to open-ended tasks.

When Not To Use

Do not apply choice-confusion as-is to open-ended generation tasks without adapting the generalized test.

Avoid interpreting small generalized gains on numeric-choice datasets (MathQA) as clean behavior.

Failure Modes

Numeric-only choice sets (e.g., MathQA) reduce the power of choice-confusion detection.

Low-quality translations can reduce contamination effect and confound detection.

Core Entities

Models

LLaMA3-8BQwen1.5-7B

Metrics

Accuracygeneralized-vs-original differenceshared likelihood p-value

Datasets

MMLUARC ChallengeMathQA

Benchmarks

MMLUARC ChallengeMathQA

Context Entities

Models

Phi2-2.7BPhi3-3.8BAbel-7BGLM4-9BLLaMA3-70BReflection-70B