Overview
The method is simple and effective for multiple-choice benchmarks and flagged contamination in experiments; it is less decisive on numeric-choice tasks and was evaluated on 7B models only.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 45%
Novelty: 70%
Why It Matters For Business
Benchmarks can be silently leaked across languages, inflating model claims. Audit multilingual training data and use generalization checks before productizing a model.
Who Should Care
Summary TLDR
The authors show a stealthy form of benchmark leakage: continually pretraining multilingual LLMs on translated test sets (cross-lingual contamination) inflates English benchmark scores while evading common text-overlap detectors. They propose a simple, generalization-based test—replace wrong choices with correct answers from other questions (choice confusion)—which reveals non-generalizable memorization. Experiments on LLaMA3-8B and Qwen1.5-7B across MMLU, ARC-Challenge and MathQA show sizable score inflation from cross-lingual contamination and that standard memorization checks often miss it. Code and data are provided.
Problem Statement
Current contamination checks look for direct text overlap. But models can memorize benchmark knowledge via translations or transformed forms and still perform well on the original language. This cross-lingual memorization inflates reported performance and escapes overlap-based detectors. We need detection that tests whether high scores reflect generalizable ability or just memorized, non-generalizable knowledge.
Main Contribution
Identify cross-lingual contamination: overfitting models on translated test sets can inflate English scores but hide from overlap checks.
Propose a generalization-based detector (choice confusion): measure performance change when wrong choices are replaced by correct choices from other questions.
Key Findings
Cross-lingual contamination raises benchmark scores substantially.
Vanilla contamination (English) drives near-perfect scores but cross-lingual also inflates performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 80.62% | 63.82% (clean) | +16.80 | LLaMA3-8B on MMLU (Spanish cross-lingual contaminated) | Cross-lingual contamination raised LLaMA3-8B MMLU to 80.62% from 63.82% | Table 1 |
| Accuracy | 97.89% | 60.09% (clean) | +37.80 | Qwen1.5-7B on MMLU (vanilla contaminated) | Direct (English) contamination nearly memorizes test sets, pushing accuracy near 98% | Table 1 |
What To Try In 7 Days
Run the paper's choice-confusion test: create a generalized benchmark by swapping wrong choices with other questions' correct answers and measure performance gap.
Run n-gram and shared-likelihood checks but treat negatives as weak evidence against cross-lingual leakage.
Scan pretraining inputs for translated benchmark content and log language-distribution of specialized corpora.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments use only 7B models; transfer to larger or smaller sizes is untested.
All target benchmarks are multiple-choice; results may not apply to open-ended tasks.
When Not To Use
Do not apply choice-confusion as-is to open-ended generation tasks without adapting the generalized test.
Avoid interpreting small generalized gains on numeric-choice datasets (MathQA) as clean behavior.
Failure Modes
Numeric-only choice sets (e.g., MathQA) reduce the power of choice-confusion detection.
Low-quality translations can reduce contamination effect and confound detection.

