Overview
Production Readiness
0.45
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Benchmarks can be silently leaked across languages, inflating model claims. Audit multilingual training data and use generalization checks before productizing a model.
Summary TLDR
The authors show a stealthy form of benchmark leakage: continually pretraining multilingual LLMs on translated test sets (cross-lingual contamination) inflates English benchmark scores while evading common text-overlap detectors. They propose a simple, generalization-based test—replace wrong choices with correct answers from other questions (choice confusion)—which reveals non-generalizable memorization. Experiments on LLaMA3-8B and Qwen1.5-7B across MMLU, ARC-Challenge and MathQA show sizable score inflation from cross-lingual contamination and that standard memorization checks often miss it. Code and data are provided.
Problem Statement
Current contamination checks look for direct text overlap. But models can memorize benchmark knowledge via translations or transformed forms and still perform well on the original language. This cross-lingual memorization inflates reported performance and escapes overlap-based detectors. We need detection that tests whether high scores reflect generalizable ability or just memorized, non-generalizable knowledge.
Main Contribution
Identify cross-lingual contamination: overfitting models on translated test sets can inflate English scores but hide from overlap checks.
Propose a generalization-based detector (choice confusion): measure performance change when wrong choices are replaced by correct choices from other questions.
Empirically show conventional memorization detectors often fail, while choice confusion reliably flags contaminated models; release code and data.
Key Findings
Cross-lingual contamination raises benchmark scores substantially.
Vanilla contamination (English) drives near-perfect scores but cross-lingual also inflates performance.
Common memorization detectors often miss cross-lingual contamination.
Generalization-based test (choice confusion) exposes contamination via weak or negative gains.
The detection metric behavior differs by dataset type.
Results
Accuracy
Accuracy
Accuracy
generalized-vs-original difference
detection p-value (shared likelihood)
Who Should Care
What To Try In 7 Days
Run the paper's choice-confusion test: create a generalized benchmark by swapping wrong choices with other questions' correct answers and measure performance gap.
Run n-gram and shared-likelihood checks but treat negatives as weak evidence against cross-lingual leakage.
Scan pretraining inputs for translated benchmark content and log language-distribution of specialized corpora.
Reproducibility
Data Urls
- public benchmark links (MMLU, ARC, MathQA) and translated sets via repo
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use only 7B models; transfer to larger or smaller sizes is untested.
- All target benchmarks are multiple-choice; results may not apply to open-ended tasks.
- Contamination injected per benchmark-language pair separately; real-world mixes of benchmarks/languages were not studied.
When Not To Use
- Do not apply choice-confusion as-is to open-ended generation tasks without adapting the generalized test.
- Avoid interpreting small generalized gains on numeric-choice datasets (MathQA) as clean behavior.
Failure Modes
- Numeric-only choice sets (e.g., MathQA) reduce the power of choice-confusion detection.
- Low-quality translations can reduce contamination effect and confound detection.
- Detector thresholds calibrated on one dataset may not generalize across benchmarks or languages.
Core Entities
Models
- LLaMA3-8B
- Qwen1.5-7B
Metrics
- Accuracy
- generalized-vs-original difference
- shared likelihood p-value
Datasets
- MMLU
- ARC Challenge
- MathQA
Benchmarks
- MMLU
- ARC Challenge
- MathQA
Context Entities
Models
- Phi2-2.7B
- Phi3-3.8B
- Abel-7B
- GLM4-9B
- LLaMA3-70B
- Reflection-70B

