Use translation + instruction tuning to make English LLMs much better in six non‑English languages

August 9, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

7

Authors

Wenhao Zhu, Yunzhe Lv, Qingxiu Dong, Fei Yuan, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li

Links

Abstract / PDF

Why It Matters For Business

You can upgrade an English LLM to handle multiple non-English languages without huge data or retraining costs by adding parallel translation tasks and translated instructions; this saves time and compute compared to building language-specific models from scratch.

Summary TLDR

The authors show you can boost a pre-trained English LLaMA-7B model on non-English tasks by instruction‑tuning it with two kinds of bilingual data: (1) translated general instruction examples and (2) parallel translation pairs. Language-specific models (x-LLaMA) gain large QA and translation improvements; a single multilingual model (m-LLaMA) matches language-specific ones. They fit a simple scaling law that links translation performance to parallel data size and use it to allocate limited parallel data more efficiently.

Problem Statement

Pretrained LLMs are English-dominant and underperform on many non-English languages. Training separate models or heavy continued pretraining is costly. Can we extrapolate English LLM ability to other languages cheaply by aligning languages during instruction-tuning?

Main Contribution

Define cross-lingual instruction-tuning (CoIT): mix translated instruction examples and parallel translation tasks to align English and a target language.

Define multilingual instruction-tuning (MuIT): mix resources for many languages to get a single multilingual LLaMA.

Estimate a scaling law relating translation performance to parallel-data scale and use it to optimize data allocation under a budget.

Show large empirical gains on QA and translation for six challenging languages without vocabulary extension or massive continued pretraining.

Key Findings

Cross-lingual instruction tuning (x-LLaMA) improves non-English QA accuracy a lot versus an English-only instruction model.

NumbersAverage +27.83% answer accuracy across six languages (XQUAD & MLQA)

x-LLaMA also raises translation quality over previous LLaMA-based models.

NumbersAverage +18.89% (FLORES-101, COMET) vs prior LLaMA baselines

A single multilingual model (m-LLaMA) can match per-language x-LLaMAs and follow multilingual instructions.

Numbersm-LLaMA achieves comparable performance to x-LLaMAs on QA and translation (figures 4 and related text)

Translation performance improves predictably as you add parallel data; the paper fits a decreasing-power scaling law.

NumbersFitted formulas per language (e.g. En-Ar: 100 - 42.5*(0.04*x)^(-0.11)) and rising COMET in Figure 3

Optimized data allocation beats uniform allocation under a budget at larger budgets.

Numbers+0.48 COMET, +0.69 BLEURT, +0.59 BLEU at 1.2M budget vs uniform

Multilingual semantic representations align in middle layers after tuning.

NumbersLayer-wise representation overlap observed in middle layers (Figure 5)

Results

Accuracy

Valuex-LLaMA average across 6 languages: +27.83% vs Alpaca-7B

BaselineAlpaca-7B (English instruction-tuned)

Translation quality (COMET on FLORES-101)

Valuex-LLaMA outperforms prior LLaMA-based models by average +18.89%

BaselinePrevious LLaMA-based models (e.g., Bayling, Parrot, Bigtrans)

Multilingual allocation gain

ValueOptimized vs uniform allocation: +0.48 COMET, +0.69 BLEURT, +0.59 BLEU at 1.2M budget

BaselineUniform data allocation across 6 languages

Representation alignment

ValueMiddle layers show cross-language representation overlap after MuIT

BaselineAlpaca-7B (no multilingual tuning)

Who Should Care

What To Try In 7 Days

Take your English instruction-tuned LLM and add a small parallel corpus (open WIKIMATRIX/NEWSCOMMENTARY) for a target language and instruction-tune.

Translate your existing instruction dataset into the target language and include both English and translated pairs during tuning.

Measure translation quality (COMET) and QA accuracy before/after to verify gains; use scaling-law curves to estimate returns from more parallel data.

Optimization Features

Token Efficiency

  • no vocabulary extension (uses byte-level tokenization)

Infra Optimization

  • 8x A100 training configuration

Training Optimization

  • full-parameter instruction tuning
  • use of FSDP for training scale

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments run on LLaMA-7B only; transfer to larger/smaller models is not shown.
  • Improvement depends on available parallel data; distant languages need more data per scaling law.
  • No vocabulary extension: tokenization is less efficient for some languages and slows encoding/decoding.
  • Automatic evaluation uses ChatGPT as judge, which can be biased and imperfect.

When Not To Use

  • When you already have large monolingual corpora and prefer vocabulary extension and heavy continued pretraining.
  • When low-latency tokenization is critical and byte-level tokenization overhead is unacceptable.
  • When parallel data for the target language is essentially unavailable.

Failure Modes

  • Model may generate English answers even for non-English instructions (mixing languages).
  • For languages with very low similarity to English, alignment may need large parallel corpora and still lag.
  • ChatGPT-based evaluation may over- or under-estimate actual human-quality improvements.

Core Entities

Models

  • LLaMA-7B
  • x-LLaMA-7B
  • m-LLaMA-7B

Metrics

  • COMET
  • BLEURT
  • BLEU
  • Exact Match
  • ChatGPT quality eval

Datasets

  • WIKIMATRIX
  • NEWSCOMMENTARY
  • ALPACA
  • FLORES-101
  • XQUAD
  • MLQA
  • MI-EVAL

Benchmarks

  • XQUAD
  • MLQA
  • FLORES-101

Context Entities

Models

  • Alpaca-7B
  • Parrot-7B
  • Bayling-7B
  • Chinese-Alpaca-7B
  • Bigtrans-13B
  • M2M-12B
  • NLLB-1.3B
  • ChatGPT
  • Google Translate

Metrics

  • WMT COMET models (wmt22-comet-da)
  • BLEURT-20

Datasets

  • mC4 (used in other continued pretraining baselines)

Benchmarks

  • Human-supervised MT systems (for reference in comparisons)