XLT: a short, language-independent prompt template that boosts non‑English LLM performance

Overview

Decision SnapshotReady For Pilot

The idea is simple and cheap: a template prompt that rephrases input in English, instructs stepwise solving, and enforces output format. Evidence covers multiple models and seven benchmarks, but tests are limited to a few LLM families and 27 languages.

Citations8

Evidence Strength0.90

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 75%

Novelty: 45%

Authors

Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, Furu Wei

Links

Abstract / PDF / Code

Why It Matters For Business

XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

The authors introduce Cross-Lingual-Thought (XLT), a language-independent prompt template that asks an LLM to retell the input in English, analyze the task, solve step-by-step (chain-of-thought), and format the output. XLT is a zero- and few-shot prompting recipe. Evaluated on 7 multilingual benchmarks (27 languages) across reasoning, understanding and generation, XLT often raises accuracy/F1/ROUGE/BLEU compared to basic prompts. Biggest wins: arithmetic reasoning (MGSM) and open-domain QA (MKQA), where XLT adds ~11+ points on average on some models. XLT also narrows the performance gap across languages by raising a democratization score in several tasks. The method is cheap to try — no fine

Problem Statement

Large language models work better in English than many other languages. This creates uneven accuracy and worse results in low-resource languages. The paper asks: can a single, language-independent prompt unlock cross‑lingual reasoning and reduce performance gaps across many tasks without retraining models?

Main Contribution

Design of XLT, a six-step, language-independent prompt template that elicits cross-lingual rephrasing (retell in English), task analysis, stepwise solving, and strict output formatting.

Comprehensive evaluation on seven multilingual benchmarks (MGSM, XCOPA, XNLI, PAWS-X, MKQA, XL-Sum, FLORES) across 27 languages, showing consistent gains in zero- and few-shot settings.

Key Findings

XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.

Numberstext-davinci-003 MGSM zero-shot: 12.5 → 23.9 (+11.4)

Practical UseUse XLT as a zero-shot prompt to get double-digit accuracy gains on arithmetic problems in non-English inputs without model changes.

Evidence RefTable 1

XLT raises open-domain QA quality (MKQA F1) across languages.

Numberstext-davinci-003 MKQA F1 zero-shot: 29.0 → 40.2 (+11.2)

Practical UseApply XLT to multilingual QA tasks to improve short-answer F1 by around ten points on evaluated benchmarks.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	text-davinci-003: basic 12.5 → XLT 23.9	12.5 (Basic Prompt)	+11.4	MGSM (zero-shot, average over languages)	Table 1, Section 3.2	Table 1
MKQA F1 (open-domain QA)	text-davinci-003: basic 29.0 → XLT 40.2	29.0 (Basic Prompt)	+11.2	MKQA (zero-shot, average over languages)	Table 1, Section 3.2	Table 1

What To Try In 7 Days

Take a failing non-English task and swap your basic prompt for the XLT template (retell in English + step-by-step + format).

Build 3–5 few-shot demonstrations using XLT input + XLT output to test few-shot gains.

Run an ablation: remove the 'retell in English' step to confirm sensitivity for your dataset.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/unilm

Risks & Boundaries

Limitations

Evaluation covers 27 languages, which is a small slice of world languages.

The XLT template is written in English; performance with templates in task languages is untested.

When Not To Use

When strict native‑language phrasing must be preserved and any English retelling could change meaning (e.g., fine-grained paraphrase detection).

When API costs or latency make long, multi-step prompts impractical.

Failure Modes

Cross-lingual rephrasing can alter sentence meaning and hurt tasks that require subtle surface cues (observed drop on PAWS-X zero-shot).

Prompt effectiveness depends on exact instruction order and word choice; swapping steps or keywords can reduce gains.

Core Entities

Models

text-davinci-003gpt-3.5-turboLLaMA-2-Chat (Llama-2-70b-chat-hf)code-davinci-002 (referenced)

Metrics

AccuracyF1ROUGE-1BLEUSacreBLEU

Datasets

MGSMXCOPAXNLIPAWS-XMKQAXL-SumFLORES-200

Benchmarks

MGSMXCOPAXNLIPAWS-XMKQAXL-SumFLORES

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.

XLT raises open-domain QA quality (MKQA F1) across languages.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding