XLT: a short, language-independent prompt template that boosts non‑English LLM performance

May 11, 20236 min

Overview

Decision SnapshotReady For Pilot

The idea is simple and cheap: a template prompt that rephrases input in English, instructs stepwise solving, and enforces output format. Evidence covers multiple models and seven benchmarks, but tests are limited to a few LLM families and 27 languages.

Citations8

Evidence Strength0.90

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 75%

Novelty: 45%

Authors

Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, Furu Wei

Links

Abstract / PDF / Code

Why It Matters For Business

XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.

Who Should Care

Summary TLDR

The authors introduce Cross-Lingual-Thought (XLT), a language-independent prompt template that asks an LLM to retell the input in English, analyze the task, solve step-by-step (chain-of-thought), and format the output. XLT is a zero- and few-shot prompting recipe. Evaluated on 7 multilingual benchmarks (27 languages) across reasoning, understanding and generation, XLT often raises accuracy/F1/ROUGE/BLEU compared to basic prompts. Biggest wins: arithmetic reasoning (MGSM) and open-domain QA (MKQA), where XLT adds ~11+ points on average on some models. XLT also narrows the performance gap across languages by raising a democratization score in several tasks. The method is cheap to try — no fine

Problem Statement

Large language models work better in English than many other languages. This creates uneven accuracy and worse results in low-resource languages. The paper asks: can a single, language-independent prompt unlock cross‑lingual reasoning and reduce performance gaps across many tasks without retraining models?

Main Contribution

Design of XLT, a six-step, language-independent prompt template that elicits cross-lingual rephrasing (retell in English), task analysis, stepwise solving, and strict output formatting.

Comprehensive evaluation on seven multilingual benchmarks (MGSM, XCOPA, XNLI, PAWS-X, MKQA, XL-Sum, FLORES) across 27 languages, showing consistent gains in zero- and few-shot settings.

Key Findings

XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.

Numberstext-davinci-003 MGSM zero-shot: 12.523.9 (+11.4)

Practical UseUse XLT as a zero-shot prompt to get double-digit accuracy gains on arithmetic problems in non-English inputs without model changes.

Evidence RefTable 1

XLT raises open-domain QA quality (MKQA F1) across languages.

Numberstext-davinci-003 MKQA F1 zero-shot: 29.040.2 (+11.2)

Practical UseApply XLT to multilingual QA tasks to improve short-answer F1 by around ten points on evaluated benchmarks.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracytext-davinci-003: basic 12.5 → XLT 23.912.5 (Basic Prompt)+11.4MGSM (zero-shot, average over languages)Table 1, Section 3.2Table 1
MKQA F1 (open-domain QA)text-davinci-003: basic 29.0 → XLT 40.229.0 (Basic Prompt)+11.2MKQA (zero-shot, average over languages)Table 1, Section 3.2Table 1

What To Try In 7 Days

Take a failing non-English task and swap your basic prompt for the XLT template (retell in English + step-by-step + format).

Build 3–5 few-shot demonstrations using XLT input + XLT output to test few-shot gains.

Run an ablation: remove the 'retell in English' step to confirm sensitivity for your dataset.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation covers 27 languages, which is a small slice of world languages.

The XLT template is written in English; performance with templates in task languages is untested.

When Not To Use

When strict native‑language phrasing must be preserved and any English retelling could change meaning (e.g., fine-grained paraphrase detection).

When API costs or latency make long, multi-step prompts impractical.

Failure Modes

Cross-lingual rephrasing can alter sentence meaning and hurt tasks that require subtle surface cues (observed drop on PAWS-X zero-shot).

Prompt effectiveness depends on exact instruction order and word choice; swapping steps or keywords can reduce gains.

Core Entities

Models

text-davinci-003gpt-3.5-turboLLaMA-2-Chat (Llama-2-70b-chat-hf)code-davinci-002 (referenced)

Metrics

AccuracyF1ROUGE-1BLEUSacreBLEU

Datasets

MGSMXCOPAXNLIPAWS-XMKQAXL-SumFLORES-200

Benchmarks

MGSMXCOPAXNLIPAWS-XMKQAXL-SumFLORES