Use translation + instruction tuning to make English LLMs much better in six non‑English languages

August 9, 20238 min

Overview

Decision SnapshotNeeds Validation

Results are measured on public benchmarks and multiple baselines, but experiments target LLaMA-7B and six languages; evaluation uses ChatGPT which can introduce bias, so evidence is solid but not universally proven.

Citations7

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Wenhao Zhu, Yunzhe Lv, Qingxiu Dong, Fei Yuan, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can upgrade an English LLM to handle multiple non-English languages without huge data or retraining costs by adding parallel translation tasks and translated instructions; this saves time and compute compared to building language-specific models from scratch.

Who Should Care

Summary TLDR

The authors show you can boost a pre-trained English LLaMA-7B model on non-English tasks by instruction‑tuning it with two kinds of bilingual data: (1) translated general instruction examples and (2) parallel translation pairs. Language-specific models (x-LLaMA) gain large QA and translation improvements; a single multilingual model (m-LLaMA) matches language-specific ones. They fit a simple scaling law that links translation performance to parallel data size and use it to allocate limited parallel data more efficiently.

Problem Statement

Pretrained LLMs are English-dominant and underperform on many non-English languages. Training separate models or heavy continued pretraining is costly. Can we extrapolate English LLM ability to other languages cheaply by aligning languages during instruction-tuning?

Main Contribution

Define cross-lingual instruction-tuning (CoIT): mix translated instruction examples and parallel translation tasks to align English and a target language.

Define multilingual instruction-tuning (MuIT): mix resources for many languages to get a single multilingual LLaMA.

Key Findings

Cross-lingual instruction tuning (x-LLaMA) improves non-English QA accuracy a lot versus an English-only instruction model.

NumbersAverage +27.83% answer accuracy across six languages (XQUAD & MLQA)

Practical UseIf you need better QA in a target language, add parallel translation tasks plus translated instruction examples to instruction-tune an English LLM instead of full re-training.

Evidence RefMain text, Abstract and Table 2

x-LLaMA also raises translation quality over previous LLaMA-based models.

NumbersAverage +18.89% (FLORES-101, COMET) vs prior LLaMA baselines

Practical UseUse translation task instruction data to get much better translation from a tuned LLaMA without heavy supervised MT models or continued pretraining.

Evidence RefAbstract; §5.1 and Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyx-LLaMA average across 6 languages: +27.83% vs Alpaca-7BAlpaca-7B (English instruction-tuned)+27.83%XQUAD & MLQA (six languages: Ar, El, Hi, Tr, Vi, Zh)Table 2 and AbstractTable 2
Translation quality (COMET on FLORES-101)x-LLaMA outperforms prior LLaMA-based models by average +18.89%Previous LLaMA-based models (e.g., Bayling, Parrot, Bigtrans)+18.89%FLORES-101Abstract; §5.1 and Figure 2Figure 2 and main text

What To Try In 7 Days

Take your English instruction-tuned LLM and add a small parallel corpus (open WIKIMATRIX/NEWSCOMMENTARY) for a target language and instruction-tune.

Translate your existing instruction dataset into the target language and include both English and translated pairs during tuning.

Measure translation quality (COMET) and QA accuracy before/after to verify gains; use scaling-law curves to estimate returns from more parallel data.

Optimization Features

Token Efficiency
no vocabulary extension (uses byte-level tokenization)
Infra Optimization
8x A100 training configuration
Training Optimization
full-parameter instruction tuninguse of FSDP for training scale

Reproducibility

Risks & Boundaries

Limitations

Experiments run on LLaMA-7B only; transfer to larger/smaller models is not shown.

Improvement depends on available parallel data; distant languages need more data per scaling law.

When Not To Use

When you already have large monolingual corpora and prefer vocabulary extension and heavy continued pretraining.

When low-latency tokenization is critical and byte-level tokenization overhead is unacceptable.

Failure Modes

Model may generate English answers even for non-English instructions (mixing languages).

For languages with very low similarity to English, alignment may need large parallel corpora and still lag.

Core Entities

Models

LLaMA-7Bx-LLaMA-7Bm-LLaMA-7B

Metrics

COMETBLEURTBLEUExact MatchChatGPT quality eval

Datasets

WIKIMATRIXNEWSCOMMENTARYALPACAFLORES-101XQUADMLQAMI-EVAL

Benchmarks

XQUADMLQAFLORES-101

Context Entities

Models

Alpaca-7BParrot-7BBayling-7BChinese-Alpaca-7BBigtrans-13BM2M-12BNLLB-1.3BChatGPTGoogle Translate

Metrics

WMT COMET models (wmt22-comet-da)BLEURT-20

Datasets

mC4 (used in other continued pretraining baselines)

Benchmarks

Human-supervised MT systems (for reference in comparisons)