You can adapt LLaMA to other languages cheaply: vocab changes often unnecessary

January 2, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

6

Authors

Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, Xuanjing Huang

Links

Abstract / PDF

Why It Matters For Business

You can cheaply adapt an English-trained LLM to other languages: keep the original tokenizer, do modest further pretraining, and invest in instruction tuning to get usable responses without massive compute.

Summary TLDR

This paper tests how to transfer LLaMA's generation and instruction-following skills into other languages. Key takeaways: extending the tokenizer vocabulary often hurts when you only have small-to-medium extra pretraining; modest further pretraining (hundreds of millions of tokens) mainly helps fluency, not world knowledge; instruction tuning with hundreds of thousands of examples fixes response quality and is far cheaper; training only on the target language can harm original English ability, but multilingual joint training preserves it. Results hold for Chinese and 13 low-resource languages.

Problem Statement

Mainstream LLMs are trained mainly on English. This work asks which steps actually matter to make an English-focused model (LLaMA) work well in a different language: add tokens, further pretrain, or instruction-tune? The goal is to find practical, low-cost strategies to transfer language and instruction-following skills.

Main Contribution

Systematic empirical study of vocabulary extension, further pretraining scale, and instruction tuning for language transfer from English-focused LLaMA.

Shows small-scale further pretraining on the original vocabulary can outperform models with extended vocabularies pretrained on far more data.

Demonstrates that instruction tuning (hundreds of thousands of examples) yields strong response quality with far less compute than large-scale pretraining.

Reports cross-language experiments on 13 low-resource languages and documents 2–5% code-switching during transfer.

Key Findings

Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.

Numbers0.5B vs 30B tokens; LLM-Eval AVG 1.562 (LLaMA 0.5B pretrain) vs 1.244 (Chinese LLaMA) (Table 1)

Large-scale further pretraining (tens of billions) did not meaningfully raise measured world knowledge on benchmarks.

NumbersNo significant accuracy gain on C-Eval / MMLU / AGI-Eval across LLaMA, Chinese LLaMA, Open Chinese LLaMA (Figure 2)

Instruction tuning with large instruction datasets improves response quality cheaply.

Numbers950K SFT yields AVG ≈2.40 for LLaMA variants, matching Open Chinese LLaMA (Table 1)

Single-language further pretraining can degrade original English performance.

NumbersChinese perplexity fell (10.151 → 5.249) while English perplexity rose (14.691 → 198.84) after further pretraining (LLaM

Code-switching appears in transfer outputs at low but nontrivial rates.

NumbersCode-switching rate ≈2%–5% across target languages (Figure 4)

Results

LLM-Eval average response quality (AVG)

Value≈2.40 after 950K SFT

Baseline1.24 (Chinese LLaMA without SFT in some settings)

Perplexity (Chinese / English)

ValueChinese 10.151→5.249; English 14.691→198.84

BaselineLLaMA (no further pretrain)

Response quality across 13 low-resource languages (LLM-Eval AVG)

ValueAverage 1k SFT AVG 0.837 → 65k SFT AVG 1.827

Baseline1k SFT

Who Should Care

What To Try In 7 Days

Run instruction tuning on the target language with 100k–1M translated/gathered examples and measure LLM-Eval scores.

Avoid adding a new tokenizer if you only plan tens of billions of extra tokens; test both ways on a small holdout.

If you must further pretrain, mix English and target-language data to avoid hurting English ability and measure perplexity for both languages.

Optimization Features

Training Optimization

  • LoRA

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Most experiments center on Chinese; other languages tested but less deeply.
  • Automated evaluation uses GPT-4 which may introduce judge bias.
  • Exact training recipes and some checkpoints are not fully released for reproduction.

When Not To Use

  • When you require strong, broad factual knowledge improvements—those likely need much larger pretraining or larger model sizes.
  • When you can afford full retraining with multi-trillion-token corpora; vocabulary extension may help at that scale.

Failure Modes

  • Vocabulary extension can reduce transfer quality at small-to-medium retraining scales.
  • Single-language further pretraining can catastrophically degrade English ability.
  • Code-switching (2%–5%) may produce mixed-language outputs that confuse users.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA2-7B
  • Chinese LLaMA 7B
  • Chinese LLaMA2 7B
  • Open Chinese LLaMA

Metrics

  • Accuracy
  • Perplexity

Datasets

  • BELLE
  • Bactrian-X
  • LLM-Eval
  • C-Eval
  • MMLU
  • AGI-Eval
  • GAOKAO-Bench

Benchmarks

  • LLM-Eval
  • C-Eval
  • MMLU
  • AGI-Eval
  • GAOKAO-Bench