You can adapt LLaMA to other languages cheaply: vocab changes often unnecessary

January 2, 20247 min

Overview

Decision SnapshotNeeds Validation

Experiments show a practical path: avoid vocabulary extension at small scales, use moderate further pretraining plus large instruction tuning, and prefer multilingual joint training to preserve English.

Citations6

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 50%

Authors

Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, Xuanjing Huang

Links

Abstract / PDF

Why It Matters For Business

You can cheaply adapt an English-trained LLM to other languages: keep the original tokenizer, do modest further pretraining, and invest in instruction tuning to get usable responses without massive compute.

Who Should Care

Summary TLDR

This paper tests how to transfer LLaMA's generation and instruction-following skills into other languages. Key takeaways: extending the tokenizer vocabulary often hurts when you only have small-to-medium extra pretraining; modest further pretraining (hundreds of millions of tokens) mainly helps fluency, not world knowledge; instruction tuning with hundreds of thousands of examples fixes response quality and is far cheaper; training only on the target language can harm original English ability, but multilingual joint training preserves it. Results hold for Chinese and 13 low-resource languages.

Problem Statement

Mainstream LLMs are trained mainly on English. This work asks which steps actually matter to make an English-focused model (LLaMA) work well in a different language: add tokens, further pretrain, or instruction-tune? The goal is to find practical, low-cost strategies to transfer language and instruction-following skills.

Main Contribution

Systematic empirical study of vocabulary extension, further pretraining scale, and instruction tuning for language transfer from English-focused LLaMA.

Shows small-scale further pretraining on the original vocabulary can outperform models with extended vocabularies pretrained on far more data.

Key Findings

Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.

Numbers0.5B vs 30B tokens; LLM-Eval AVG 1.562 (LLaMA 0.5B pretrain) vs 1.244 (Chinese LLaMA) (Table 1)

Practical UseIf you only have up to tens of billions of extra tokens, keep the original vocabulary and further-pretrain rather than adding a new tokenizer.

Evidence RefTable 1; main text

Large-scale further pretraining (tens of billions) did not meaningfully raise measured world knowledge on benchmarks.

NumbersNo significant accuracy gain on C-Eval / MMLU / AGI-Eval across LLaMA, Chinese LLaMA, Open Chinese LLaMA (Figure 2)

Practical UseDon't expect tens-of-billions extra tokens to deliver big knowledge gains; to raise knowledge you likely need much larger pretraining or larger models.

Evidence RefFigure 2; main text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LLM-Eval average response quality (AVG)≈2.40 after 950K SFT1.24 (Chinese LLaMA without SFT in some settings)≈+1.16LLM-Eval (Chinese)Table 1 shows LLaMA with 1M pretrain and large SFT AVG ≈2.399 ≈ Chinese LLaMA / Open Chinese LLaMATable 1
Perplexity (Chinese / English)Chinese 10.1515.249; English 14.691198.84LLaMA (no further pretrain)Chinese ↓ ~48%, English ↑ large (catastrophic)200k Chinese / 200k English samples (perplexity test)Table 2 reports perplexity changes across further pretraining scalesTable 2

What To Try In 7 Days

Run instruction tuning on the target language with 100k–1M translated/gathered examples and measure LLM-Eval scores.

Avoid adding a new tokenizer if you only plan tens of billions of extra tokens; test both ways on a small holdout.

If you must further pretrain, mix English and target-language data to avoid hurting English ability and measure perplexity for both languages.

Optimization Features

Training Optimization
LoRA

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Most experiments center on Chinese; other languages tested but less deeply.

Automated evaluation uses GPT-4 which may introduce judge bias.

When Not To Use

When you require strong, broad factual knowledge improvements—those likely need much larger pretraining or larger model sizes.

When you can afford full retraining with multi-trillion-token corpora; vocabulary extension may help at that scale.

Failure Modes

Vocabulary extension can reduce transfer quality at small-to-medium retraining scales.

Single-language further pretraining can catastrophically degrade English ability.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA2-7BChinese LLaMA 7BChinese LLaMA2 7BOpen Chinese LLaMA

Metrics

AccuracyPerplexity

Datasets

BELLEBactrian-XLLM-EvalC-EvalMMLUAGI-EvalGAOKAO-Bench

Benchmarks

LLM-EvalC-EvalMMLUAGI-EvalGAOKAO-Bench