Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
6
Why It Matters For Business
You can cheaply adapt an English-trained LLM to other languages: keep the original tokenizer, do modest further pretraining, and invest in instruction tuning to get usable responses without massive compute.
Summary TLDR
This paper tests how to transfer LLaMA's generation and instruction-following skills into other languages. Key takeaways: extending the tokenizer vocabulary often hurts when you only have small-to-medium extra pretraining; modest further pretraining (hundreds of millions of tokens) mainly helps fluency, not world knowledge; instruction tuning with hundreds of thousands of examples fixes response quality and is far cheaper; training only on the target language can harm original English ability, but multilingual joint training preserves it. Results hold for Chinese and 13 low-resource languages.
Problem Statement
Mainstream LLMs are trained mainly on English. This work asks which steps actually matter to make an English-focused model (LLaMA) work well in a different language: add tokens, further pretrain, or instruction-tune? The goal is to find practical, low-cost strategies to transfer language and instruction-following skills.
Main Contribution
Systematic empirical study of vocabulary extension, further pretraining scale, and instruction tuning for language transfer from English-focused LLaMA.
Shows small-scale further pretraining on the original vocabulary can outperform models with extended vocabularies pretrained on far more data.
Demonstrates that instruction tuning (hundreds of thousands of examples) yields strong response quality with far less compute than large-scale pretraining.
Reports cross-language experiments on 13 low-resource languages and documents 2–5% code-switching during transfer.
Key Findings
Extending the tokenizer vocabulary can hurt transfer at small-to-moderate retraining scales.
Large-scale further pretraining (tens of billions) did not meaningfully raise measured world knowledge on benchmarks.
Instruction tuning with large instruction datasets improves response quality cheaply.
Single-language further pretraining can degrade original English performance.
Code-switching appears in transfer outputs at low but nontrivial rates.
Results
LLM-Eval average response quality (AVG)
Perplexity (Chinese / English)
Response quality across 13 low-resource languages (LLM-Eval AVG)
Who Should Care
What To Try In 7 Days
Run instruction tuning on the target language with 100k–1M translated/gathered examples and measure LLM-Eval scores.
Avoid adding a new tokenizer if you only plan tens of billions of extra tokens; test both ways on a small holdout.
If you must further pretrain, mix English and target-language data to avoid hurting English ability and measure perplexity for both languages.
Optimization Features
Training Optimization
- LoRA
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Most experiments center on Chinese; other languages tested but less deeply.
- Automated evaluation uses GPT-4 which may introduce judge bias.
- Exact training recipes and some checkpoints are not fully released for reproduction.
When Not To Use
- When you require strong, broad factual knowledge improvements—those likely need much larger pretraining or larger model sizes.
- When you can afford full retraining with multi-trillion-token corpora; vocabulary extension may help at that scale.
Failure Modes
- Vocabulary extension can reduce transfer quality at small-to-medium retraining scales.
- Single-language further pretraining can catastrophically degrade English ability.
- Code-switching (2%–5%) may produce mixed-language outputs that confuse users.
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- LLaMA2-7B
- Chinese LLaMA 7B
- Chinese LLaMA2 7B
- Open Chinese LLaMA
Metrics
- Accuracy
- Perplexity
Datasets
- BELLE
- Bactrian-X
- LLM-Eval
- C-Eval
- MMLU
- AGI-Eval
- GAOKAO-Bench
Benchmarks
- LLM-Eval
- C-Eval
- MMLU
- AGI-Eval
- GAOKAO-Bench

