Overview
Production Readiness
0.75
Novelty Score
0.45
Cost Impact Score
0.85
Citation Count
8
Why It Matters For Business
XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.
Summary TLDR
The authors introduce Cross-Lingual-Thought (XLT), a language-independent prompt template that asks an LLM to retell the input in English, analyze the task, solve step-by-step (chain-of-thought), and format the output. XLT is a zero- and few-shot prompting recipe. Evaluated on 7 multilingual benchmarks (27 languages) across reasoning, understanding and generation, XLT often raises accuracy/F1/ROUGE/BLEU compared to basic prompts. Biggest wins: arithmetic reasoning (MGSM) and open-domain QA (MKQA), where XLT adds ~11+ points on average on some models. XLT also narrows the performance gap across languages by raising a democratization score in several tasks. The method is cheap to try — no fine
Problem Statement
Large language models work better in English than many other languages. This creates uneven accuracy and worse results in low-resource languages. The paper asks: can a single, language-independent prompt unlock cross‑lingual reasoning and reduce performance gaps across many tasks without retraining models?
Main Contribution
Design of XLT, a six-step, language-independent prompt template that elicits cross-lingual rephrasing (retell in English), task analysis, stepwise solving, and strict output formatting.
Comprehensive evaluation on seven multilingual benchmarks (MGSM, XCOPA, XNLI, PAWS-X, MKQA, XL-Sum, FLORES) across 27 languages, showing consistent gains in zero- and few-shot settings.
Ablations that identify Cross-lingual Thinking and CoT-style instructions as the most important parts, and guidance for constructing few-shot demonstrations aligned with XLT.
Key Findings
XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.
XLT raises open-domain QA quality (MKQA F1) across languages.
XLT reduces language performance disparity (democratization score) for many tasks.
The Cross-lingual Thinking instruction is the largest single contributor.
Results
Accuracy
MKQA F1 (open-domain QA)
Accuracy
Democratization score (language parity)
Who Should Care
What To Try In 7 Days
Take a failing non-English task and swap your basic prompt for the XLT template (retell in English + step-by-step + format).
Build 3–5 few-shot demonstrations using XLT input + XLT output to test few-shot gains.
Run an ablation: remove the 'retell in English' step to confirm sensitivity for your dataset.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation covers 27 languages, which is a small slice of world languages.
- The XLT template is written in English; performance with templates in task languages is untested.
- Experiments mainly use two GPT-3.5 models and one LLaMA-2 chat model; broader model generality is unverified.
When Not To Use
- When strict native‑language phrasing must be preserved and any English retelling could change meaning (e.g., fine-grained paraphrase detection).
- When API costs or latency make long, multi-step prompts impractical.
- When the model cannot follow multi-step instructions reliably (some smaller/chat-tuned models showed smaller gains).
Failure Modes
- Cross-lingual rephrasing can alter sentence meaning and hurt tasks that require subtle surface cues (observed drop on PAWS-X zero-shot).
- Prompt effectiveness depends on exact instruction order and word choice; swapping steps or keywords can reduce gains.
- Few-shot demonstrations must match XLT input-output style; mismatched demos can degrade performance.
Core Entities
Models
- text-davinci-003
- gpt-3.5-turbo
- LLaMA-2-Chat (Llama-2-70b-chat-hf)
- code-davinci-002 (referenced)
Metrics
- Accuracy
- F1
- ROUGE-1
- BLEU
- SacreBLEU
Datasets
- MGSM
- XCOPA
- XNLI
- PAWS-X
- MKQA
- XL-Sum
- FLORES-200
Benchmarks
- MGSM
- XCOPA
- XNLI
- PAWS-X
- MKQA
- XL-Sum
- FLORES

