You can adapt LLaMA to other languages cheaply: vocab changes often unnecessary

Overview

Decision SnapshotNeeds Validation

Experiments show a practical path: avoid vocabulary extension at small scales, use moderate further pretraining plus large instruction tuning, and prefer multilingual joint training to preserve English.

Citations6

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 50%

Authors

Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, Xuanjing Huang

Links

Abstract / PDF

Why It Matters For Business

You can cheaply adapt an English-trained LLM to other languages: keep the original tokenizer, do modest further pretraining, and invest in instruction tuning to get usable responses without massive compute.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead Founder

Summary TLDR

This paper tests how to transfer LLaMA's generation and instruction-following skills into other languages. Key takeaways: extending the tokenizer vocabulary often hurts when you only have small-to-medium extra pretraining; modest further pretraining (hundreds of millions of tokens) mainly helps fluency, not world knowledge; instruction tuning with hundreds of thousands of examples fixes response quality and is far cheaper; training only on the target language can harm original English ability, but multilingual joint training preserves it. Results hold for Chinese and 13 low-resource languages.

Problem Statement

Mainstream LLMs are trained mainly on English. This work asks which steps actually matter to make an English-focused model (LLaMA) work well in a different language: add tokens, further pretrain, or instruction-tune? The goal is to find practical, low-cost strategies to transfer language and instruction-following skills.

Main Contribution

Systematic empirical study of vocabulary extension, further pretraining scale, and instruction tuning for language transfer from English-focused LLaMA.

Shows small-scale further pretraining on the original vocabulary can outperform models with extended vocabularies pretrained on far more data.

Key Findings

Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.

Numbers0.5B vs 30B tokens; LLM-Eval AVG 1.562 (LLaMA 0.5B pretrain) vs 1.244 (Chinese LLaMA) (Table 1)

Practical UseIf you only have up to tens of billions of extra tokens, keep the original vocabulary and further-pretrain rather than adding a new tokenizer.

Evidence RefTable 1; main text

Large-scale further pretraining (tens of billions) did not meaningfully raise measured world knowledge on benchmarks.

NumbersNo significant accuracy gain on C-Eval / MMLU / AGI-Eval across LLaMA, Chinese LLaMA, Open Chinese LLaMA (Figure 2)

Practical UseDon't expect tens-of-billions extra tokens to deliver big knowledge gains; to raise knowledge you likely need much larger pretraining or larger models.

Evidence RefFigure 2; main text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLM-Eval average response quality (AVG)	≈2.40 after 950K SFT	1.24 (Chinese LLaMA without SFT in some settings)	≈+1.16	LLM-Eval (Chinese)	Table 1 shows LLaMA with 1M pretrain and large SFT AVG ≈2.399 ≈ Chinese LLaMA / Open Chinese LLaMA	Table 1
Perplexity (Chinese / English)	Chinese 10.151→5.249; English 14.691→198.84	LLaMA (no further pretrain)	Chinese ↓ ~48%, English ↑ large (catastrophic)	200k Chinese / 200k English samples (perplexity test)	Table 2 reports perplexity changes across further pretraining scales	Table 2

What To Try In 7 Days

Run instruction tuning on the target language with 100k–1M translated/gathered examples and measure LLM-Eval scores.

Avoid adding a new tokenizer if you only plan tens of billions of extra tokens; test both ways on a small holdout.

If you must further pretrain, mix English and target-language data to avoid hurting English ability and measure perplexity for both languages.

Optimization Features

Training Optimization

LoRA

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Most experiments center on Chinese; other languages tested but less deeply.

Automated evaluation uses GPT-4 which may introduce judge bias.

When Not To Use

When you require strong, broad factual knowledge improvements—those likely need much larger pretraining or larger model sizes.

When you can afford full retraining with multi-trillion-token corpora; vocabulary extension may help at that scale.

Failure Modes

Vocabulary extension can reduce transfer quality at small-to-medium retraining scales.

Single-language further pretraining can catastrophically degrade English ability.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA2-7BChinese LLaMA 7BChinese LLaMA2 7BOpen Chinese LLaMA

Metrics

AccuracyPerplexity

Datasets

BELLEBactrian-XLLM-EvalC-EvalMMLUAGI-EvalGAOKAO-Bench

Benchmarks

LLM-EvalC-EvalMMLUAGI-EvalGAOKAO-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.

Large-scale further pretraining (tens of billions) did not meaningfully raise measured world knowledge on benchmarks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding