Models can memorize benchmarks in other languages and still cheat English leaderboards

June 19, 20247 min

Overview

Production Readiness

0.45

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang

Links

Abstract / PDF

Why It Matters For Business

Benchmarks can be silently leaked across languages, inflating model claims. Audit multilingual training data and use generalization checks before productizing a model.

Summary TLDR

The authors show a stealthy form of benchmark leakage: continually pretraining multilingual LLMs on translated test sets (cross-lingual contamination) inflates English benchmark scores while evading common text-overlap detectors. They propose a simple, generalization-based test—replace wrong choices with correct answers from other questions (choice confusion)—which reveals non-generalizable memorization. Experiments on LLaMA3-8B and Qwen1.5-7B across MMLU, ARC-Challenge and MathQA show sizable score inflation from cross-lingual contamination and that standard memorization checks often miss it. Code and data are provided.

Problem Statement

Current contamination checks look for direct text overlap. But models can memorize benchmark knowledge via translations or transformed forms and still perform well on the original language. This cross-lingual memorization inflates reported performance and escapes overlap-based detectors. We need detection that tests whether high scores reflect generalizable ability or just memorized, non-generalizable knowledge.

Main Contribution

Identify cross-lingual contamination: overfitting models on translated test sets can inflate English scores but hide from overlap checks.

Propose a generalization-based detector (choice confusion): measure performance change when wrong choices are replaced by correct choices from other questions.

Empirically show conventional memorization detectors often fail, while choice confusion reliably flags contaminated models; release code and data.

Key Findings

Cross-lingual contamination raises benchmark scores substantially.

NumbersLLaMA3-8B MMLU: 63.82% → 80.62% (Spanish)

Vanilla contamination (English) drives near-perfect scores but cross-lingual also inflates performance.

NumbersVanilla LLaMA3-8B MMLU: 98.01%; Qwen1.5-7B MMLU: 97.89%

Common memorization detectors often miss cross-lingual contamination.

NumbersN-gram accuracy: clean MMLU 10.02% vs vanilla 73.34% vs Spanish cross-lingual 2.41%

Generalization-based test (choice confusion) exposes contamination via weak or negative gains.

NumbersLLaMA3-8B MMLU difference (gen − orig): clean +26.25 vs vanilla −17.00 vs Spanish −17.84

The detection metric behavior differs by dataset type.

NumbersMathQA shows small positive differences even when contaminated (e.g., LLaMA3-8B MathQA: +13.56 clean vs +0.34 vanilla)

Results

Accuracy

Value80.62%

Baseline63.82% (clean)

Accuracy

Value97.89%

Baseline60.09% (clean)

Accuracy

Value73.34%

Baseline10.02%

generalized-vs-original difference

Value-17.84

Baseline+26.25 (clean)

detection p-value (shared likelihood)

Value1.99e-7

Baseline0.4876 (clean)

Who Should Care

What To Try In 7 Days

Run the paper's choice-confusion test: create a generalized benchmark by swapping wrong choices with other questions' correct answers and measure performance gap.

Run n-gram and shared-likelihood checks but treat negatives as weak evidence against cross-lingual leakage.

Scan pretraining inputs for translated benchmark content and log language-distribution of specialized corpora.

Reproducibility

Data Urls

  • public benchmark links (MMLU, ARC, MathQA) and translated sets via repo

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use only 7B models; transfer to larger or smaller sizes is untested.
  • All target benchmarks are multiple-choice; results may not apply to open-ended tasks.
  • Contamination injected per benchmark-language pair separately; real-world mixes of benchmarks/languages were not studied.

When Not To Use

  • Do not apply choice-confusion as-is to open-ended generation tasks without adapting the generalized test.
  • Avoid interpreting small generalized gains on numeric-choice datasets (MathQA) as clean behavior.

Failure Modes

  • Numeric-only choice sets (e.g., MathQA) reduce the power of choice-confusion detection.
  • Low-quality translations can reduce contamination effect and confound detection.
  • Detector thresholds calibrated on one dataset may not generalize across benchmarks or languages.

Core Entities

Models

  • LLaMA3-8B
  • Qwen1.5-7B

Metrics

  • Accuracy
  • generalized-vs-original difference
  • shared likelihood p-value

Datasets

  • MMLU
  • ARC Challenge
  • MathQA

Benchmarks

  • MMLU
  • ARC Challenge
  • MathQA

Context Entities

Models

  • Phi2-2.7B
  • Phi3-3.8B
  • Abel-7B
  • GLM4-9B
  • LLaMA3-70B
  • Reflection-70B