Overview
The idea is simple and cheap: a template prompt that rephrases input in English, instructs stepwise solving, and enforces output format. Evidence covers multiple models and seven benchmarks, but tests are limited to a few LLM families and 27 languages.
Citations8
Evidence Strength0.90
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 85%
Production readiness: 75%
Novelty: 45%
Why It Matters For Business
XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.
Who Should Care
Summary TLDR
The authors introduce Cross-Lingual-Thought (XLT), a language-independent prompt template that asks an LLM to retell the input in English, analyze the task, solve step-by-step (chain-of-thought), and format the output. XLT is a zero- and few-shot prompting recipe. Evaluated on 7 multilingual benchmarks (27 languages) across reasoning, understanding and generation, XLT often raises accuracy/F1/ROUGE/BLEU compared to basic prompts. Biggest wins: arithmetic reasoning (MGSM) and open-domain QA (MKQA), where XLT adds ~11+ points on average on some models. XLT also narrows the performance gap across languages by raising a democratization score in several tasks. The method is cheap to try — no fine
Problem Statement
Large language models work better in English than many other languages. This creates uneven accuracy and worse results in low-resource languages. The paper asks: can a single, language-independent prompt unlock cross‑lingual reasoning and reduce performance gaps across many tasks without retraining models?
Main Contribution
Design of XLT, a six-step, language-independent prompt template that elicits cross-lingual rephrasing (retell in English), task analysis, stepwise solving, and strict output formatting.
Comprehensive evaluation on seven multilingual benchmarks (MGSM, XCOPA, XNLI, PAWS-X, MKQA, XL-Sum, FLORES) across 27 languages, showing consistent gains in zero- and few-shot settings.
Key Findings
XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.
XLT raises open-domain QA quality (MKQA F1) across languages.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | text-davinci-003: basic 12.5 → XLT 23.9 | 12.5 (Basic Prompt) | +11.4 | MGSM (zero-shot, average over languages) | Table 1, Section 3.2 | Table 1 |
| MKQA F1 (open-domain QA) | text-davinci-003: basic 29.0 → XLT 40.2 | 29.0 (Basic Prompt) | +11.2 | MKQA (zero-shot, average over languages) | Table 1, Section 3.2 | Table 1 |
What To Try In 7 Days
Take a failing non-English task and swap your basic prompt for the XLT template (retell in English + step-by-step + format).
Build 3–5 few-shot demonstrations using XLT input + XLT output to test few-shot gains.
Run an ablation: remove the 'retell in English' step to confirm sensitivity for your dataset.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Evaluation covers 27 languages, which is a small slice of world languages.
The XLT template is written in English; performance with templates in task languages is untested.
When Not To Use
When strict native‑language phrasing must be preserved and any English retelling could change meaning (e.g., fine-grained paraphrase detection).
When API costs or latency make long, multi-step prompts impractical.
Failure Modes
Cross-lingual rephrasing can alter sentence meaning and hurt tasks that require subtle surface cues (observed drop on PAWS-X zero-shot).
Prompt effectiveness depends on exact instruction order and word choice; swapping steps or keywords can reduce gains.

