XLT: a short, language-independent prompt template that boosts non‑English LLM performance

May 11, 20236 min

Overview

Production Readiness

0.75

Novelty Score

0.45

Cost Impact Score

0.85

Citation Count

8

Authors

Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.

Summary TLDR

The authors introduce Cross-Lingual-Thought (XLT), a language-independent prompt template that asks an LLM to retell the input in English, analyze the task, solve step-by-step (chain-of-thought), and format the output. XLT is a zero- and few-shot prompting recipe. Evaluated on 7 multilingual benchmarks (27 languages) across reasoning, understanding and generation, XLT often raises accuracy/F1/ROUGE/BLEU compared to basic prompts. Biggest wins: arithmetic reasoning (MGSM) and open-domain QA (MKQA), where XLT adds ~11+ points on average on some models. XLT also narrows the performance gap across languages by raising a democratization score in several tasks. The method is cheap to try — no fine

Problem Statement

Large language models work better in English than many other languages. This creates uneven accuracy and worse results in low-resource languages. The paper asks: can a single, language-independent prompt unlock cross‑lingual reasoning and reduce performance gaps across many tasks without retraining models?

Main Contribution

Design of XLT, a six-step, language-independent prompt template that elicits cross-lingual rephrasing (retell in English), task analysis, stepwise solving, and strict output formatting.

Comprehensive evaluation on seven multilingual benchmarks (MGSM, XCOPA, XNLI, PAWS-X, MKQA, XL-Sum, FLORES) across 27 languages, showing consistent gains in zero- and few-shot settings.

Ablations that identify Cross-lingual Thinking and CoT-style instructions as the most important parts, and guidance for constructing few-shot demonstrations aligned with XLT.

Key Findings

XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.

Numberstext-davinci-003 MGSM zero-shot: 12.5 → 23.9 (+11.4)

XLT raises open-domain QA quality (MKQA F1) across languages.

Numberstext-davinci-003 MKQA F1 zero-shot: 29.0 → 40.2 (+11.2)

XLT reduces language performance disparity (democratization score) for many tasks.

NumbersMKQA democratization zero-shot (text-davinci-003): 60.2 → 78.7 (+18.5)

The Cross-lingual Thinking instruction is the largest single contributor.

NumbersAblation (gpt-3.5-turbo MGSM zh) XLT 72.6 → w/o Cross-lingual Thinking 62.0 (−10.6)

Results

Accuracy

Valuetext-davinci-003: basic 12.5 → XLT 23.9

Baseline12.5 (Basic Prompt)

MKQA F1 (open-domain QA)

Valuetext-davinci-003: basic 29.0 → XLT 40.2

Baseline29.0 (Basic Prompt)

Accuracy

Valuetext-davinci-003: basic 53.3 → XLT 62.4

Baseline53.3 (Basic Prompt)

Democratization score (language parity)

ValueMKQA text-davinci-003: basic 60.2 → XLT 78.7

Baseline60.2 (Basic Prompt)

Who Should Care

What To Try In 7 Days

Take a failing non-English task and swap your basic prompt for the XLT template (retell in English + step-by-step + format).

Build 3–5 few-shot demonstrations using XLT input + XLT output to test few-shot gains.

Run an ablation: remove the 'retell in English' step to confirm sensitivity for your dataset.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation covers 27 languages, which is a small slice of world languages.
  • The XLT template is written in English; performance with templates in task languages is untested.
  • Experiments mainly use two GPT-3.5 models and one LLaMA-2 chat model; broader model generality is unverified.

When Not To Use

  • When strict native‑language phrasing must be preserved and any English retelling could change meaning (e.g., fine-grained paraphrase detection).
  • When API costs or latency make long, multi-step prompts impractical.
  • When the model cannot follow multi-step instructions reliably (some smaller/chat-tuned models showed smaller gains).

Failure Modes

  • Cross-lingual rephrasing can alter sentence meaning and hurt tasks that require subtle surface cues (observed drop on PAWS-X zero-shot).
  • Prompt effectiveness depends on exact instruction order and word choice; swapping steps or keywords can reduce gains.
  • Few-shot demonstrations must match XLT input-output style; mismatched demos can degrade performance.

Core Entities

Models

  • text-davinci-003
  • gpt-3.5-turbo
  • LLaMA-2-Chat (Llama-2-70b-chat-hf)
  • code-davinci-002 (referenced)

Metrics

  • Accuracy
  • F1
  • ROUGE-1
  • BLEU
  • SacreBLEU

Datasets

  • MGSM
  • XCOPA
  • XNLI
  • PAWS-X
  • MKQA
  • XL-Sum
  • FLORES-200

Benchmarks

  • MGSM
  • XCOPA
  • XNLI
  • PAWS-X
  • MKQA
  • XL-Sum
  • FLORES