Use translation + instruction tuning to make English LLMs much better in six non‑English languages

Overview

Decision SnapshotNeeds Validation

Results are measured on public benchmarks and multiple baselines, but experiments target LLaMA-7B and six languages; evaluation uses ChatGPT which can introduce bias, so evidence is solid but not universally proven.

Citations7

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Wenhao Zhu, Yunzhe Lv, Qingxiu Dong, Fei Yuan, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can upgrade an English LLM to handle multiple non-English languages without huge data or retraining costs by adding parallel translation tasks and translated instructions; this saves time and compute compared to building language-specific models from scratch.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors show you can boost a pre-trained English LLaMA-7B model on non-English tasks by instruction‑tuning it with two kinds of bilingual data: (1) translated general instruction examples and (2) parallel translation pairs. Language-specific models (x-LLaMA) gain large QA and translation improvements; a single multilingual model (m-LLaMA) matches language-specific ones. They fit a simple scaling law that links translation performance to parallel data size and use it to allocate limited parallel data more efficiently.

Problem Statement

Pretrained LLMs are English-dominant and underperform on many non-English languages. Training separate models or heavy continued pretraining is costly. Can we extrapolate English LLM ability to other languages cheaply by aligning languages during instruction-tuning?

Main Contribution

Define cross-lingual instruction-tuning (CoIT): mix translated instruction examples and parallel translation tasks to align English and a target language.

Define multilingual instruction-tuning (MuIT): mix resources for many languages to get a single multilingual LLaMA.

Key Findings

Cross-lingual instruction tuning (x-LLaMA) improves non-English QA accuracy a lot versus an English-only instruction model.

NumbersAverage +27.83% answer accuracy across six languages (XQUAD & MLQA)

Practical UseIf you need better QA in a target language, add parallel translation tasks plus translated instruction examples to instruction-tune an English LLM instead of full re-training.

Evidence RefMain text, Abstract and Table 2

x-LLaMA also raises translation quality over previous LLaMA-based models.

NumbersAverage +18.89% (FLORES-101, COMET) vs prior LLaMA baselines

Practical UseUse translation task instruction data to get much better translation from a tuned LLaMA without heavy supervised MT models or continued pretraining.

Evidence RefAbstract; §5.1 and Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	x-LLaMA average across 6 languages: +27.83% vs Alpaca-7B	Alpaca-7B (English instruction-tuned)	+27.83%	XQUAD & MLQA (six languages: Ar, El, Hi, Tr, Vi, Zh)	Table 2 and Abstract	Table 2
Translation quality (COMET on FLORES-101)	x-LLaMA outperforms prior LLaMA-based models by average +18.89%	Previous LLaMA-based models (e.g., Bayling, Parrot, Bigtrans)	+18.89%	FLORES-101	Abstract; §5.1 and Figure 2	Figure 2 and main text

What To Try In 7 Days

Take your English instruction-tuned LLM and add a small parallel corpus (open WIKIMATRIX/NEWSCOMMENTARY) for a target language and instruction-tune.

Translate your existing instruction dataset into the target language and include both English and translated pairs during tuning.

Measure translation quality (COMET) and QA accuracy before/after to verify gains; use scaling-law curves to estimate returns from more parallel data.

Optimization Features

Token Efficiency

no vocabulary extension (uses byte-level tokenization)

Infra Optimization

8x A100 training configuration

Training Optimization

full-parameter instruction tuninguse of FSDP for training scale

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NJUNLP/x-LLM https://arxiv.org/pdf/2308.04948v2

Data URLs

https://opus.nlpl.eu/News-Commentary.php (NEWSCOMMENTARY)https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix (WIKIMATRIX)https://github.com/tatsu-lab/stanford_alpaca (ALPACA)https://github.com/facebookresearch/flores (FLORES-101)XQUAD / MLQA (public benchmarks)

Risks & Boundaries

Limitations

Experiments run on LLaMA-7B only; transfer to larger/smaller models is not shown.

Improvement depends on available parallel data; distant languages need more data per scaling law.

When Not To Use

When you already have large monolingual corpora and prefer vocabulary extension and heavy continued pretraining.

When low-latency tokenization is critical and byte-level tokenization overhead is unacceptable.

Failure Modes

Model may generate English answers even for non-English instructions (mixing languages).

For languages with very low similarity to English, alignment may need large parallel corpora and still lag.

Core Entities

Models

LLaMA-7Bx-LLaMA-7Bm-LLaMA-7B

Metrics

COMETBLEURTBLEUExact MatchChatGPT quality eval

Datasets

WIKIMATRIXNEWSCOMMENTARYALPACAFLORES-101XQUADMLQAMI-EVAL

Benchmarks

XQUADMLQAFLORES-101

Context Entities

Models

Alpaca-7BParrot-7BBayling-7BChinese-Alpaca-7BBigtrans-13BM2M-12BNLLB-1.3BChatGPTGoogle Translate

Metrics

WMT COMET models (wmt22-comet-da)BLEURT-20

Datasets

mC4 (used in other continued pretraining baselines)

Benchmarks

Human-supervised MT systems (for reference in comparisons)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Cross-lingual instruction tuning (x-LLaMA) improves non-English QA accuracy a lot versus an English-only instruction model.

x-LLaMA also raises translation quality over previous LLaMA-based models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding