Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.
Summary TLDR
This paper reports a practical recipe for continual pre-training (CPT) of Llama‑3 (8B) to improve Chinese language ability and multidisciplinary scientific reasoning. The approach uses two CPT stages (bilingual adaptation and synthetic enhancement), topic-based data mixture, a perplexity (PPL) easy→hard curriculum, and large-scale synthetic QA data (1.5B tokens). On evaluated benchmarks, the continually pretrained model Llama‑3‑SynE improves Chinese and science scores substantially while largely retaining original skills, using about 100B CPT tokens. The team releases data, checkpoints, and code.
Problem Statement
Large LLMs trained mainly on English data (e.g., Llama‑3) underperform on Chinese tasks and some scientific reasoning. Continual pre‑training can adapt models but risks catastrophic forgetting. The practical question: how to design CPT data, mixture, and curriculum — and whether synthetic scientific QA helps — to add Chinese and scientific skills without losing original abilities under a limited token budget.
Main Contribution
A complete CPT pipeline for Llama‑3 (8B): bilingual adaptation followed by synthetic enhancement, with released data and code.
Design and validation of two data strategies: topic‑level mixture and a PPL (perplexity) based easy→hard curriculum for bilingual adaptation.
Large‑scale synthetic QA generation across nine scientific fields plus code QA; shows synthetic QA markedly improves scientific reasoning.
Surrogate tuning on TinyLlama (1.1B) to explore hyperparameters, then transfer the best settings to Llama‑3 (8B).
Key Findings
C-Eval (Chinese) improved by 8.81 points after CPT.
CMMLU (Chinese multi‑task) improved by 6.31 points after CPT.
MATH (math reasoning) improved by 12.00 points after CPT.
SciEval (scientific reasoning average) improved by 4.13 points.
Total CPT token budget ≈ 100B tokens; synthetic data = 1.5B tokens.
TinyLlama surrogate showed adding 1B synthetic tokens to 4B normal tokens outperformed 5B normal tokens.
Synthetic noise tolerance: low corruption (~30%) causes little harm; high corruption (>50%) degrades performance.
Best synthetic ratio around 20% (TinyLlama), and easy→hard (low→high PPL) curriculum helps.
Results
C-Eval (few-shot, Chinese)
CMMLU (few-shot, Chinese multi-task)
MATH (few-shot, math reasoning)
SciEval (avg scientific reasoning)
HumanEval (code generation)
Total CPT tokens used
Synthetic data size in released corpus
Who Should Care
What To Try In 7 Days
Run surrogate CPT on a small model (TinyLlama) with 4B normal + 1B synthetic tokens to test gains quickly.
Generate a 100k–1M synthetic QA seed for your domain and mix it ~20% into CPT data.
Implement a PPL-based curriculum: sort new-domain instances from low→high PPL and fine‑tune for a few billion tokens.
Optimization Features
Token Efficiency
- CPT completed with ≈100B tokens
- synthetic tokens comprise 1.5B of released corpus
Infra Optimization
- DeepSpeed ZeRO Stage 2
- HuggingFace Transformers
System Optimization
- FlashAttention
- gradient checkpointing
- BFloat16 mixed precision
Training Optimization
- two-stage CPT (bilingual adaptation → synthetic enhancement)
- topic-based data mixture (topic classifiers + dynamic weights)
- PPL-based data curriculum (easy→hard ordering)
- use of a small surrogate model to find settings before scaling
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Slight underperformance on some English benchmarks (MMLU) reported; CPT can harm original skills if data mix is inappropriate.
- Synthetic data quality matters: heavy corruption (>50%) degrades performance.
- Topic separation (training by discipline blocks) did not help and sometimes hurt performance.
When Not To Use
- When you require strict SOTA performance on original English-only benchmarks without any domain drift.
- If you cannot guarantee reasonable quality of generated synthetic QA (no validation pipeline).
- When you lack the compute to run ≈100B token CPT or an equivalent scaled workflow.
Failure Modes
- Catastrophic forgetting if bilingual/synthetic ratios are misbalanced.
- Performance drop from low-quality synthetic data (garbled or wrong answers).
- Overfitting to QA format, hurting diverse downstream formats.
Core Entities
Models
- Llama-3 (8B)
- Llama-3-SynE
- TinyLlama (1.1B)
- Mistral-7B-Instruct-v0.3
- Magicoder-S-DS-6.7B
- GPT-4
Metrics
- Accuracy
- perplexity (PPL)
Datasets
- Dolma CC subsets
- C4
- LeetCode
- Yulan-3 corpus (reference)
- WebInstruct
- Cosmopedia
Benchmarks
- C-Eval
- CMMLU
- MMLU
- MATH
- GSM8K
- ASDiv
- MAWPS
- SAT-Math
- HumanEval
- MBPP
- SciEval
- SciQ
- GaoKao
- ARC
- AQUA-RAT
Context Entities
Models
- DCLM-7B
- Mistral-7B-v0.3
- MAmmoTH2-8B
- Galactica-6.7B
- Llama-3-Chinese-8B
Metrics
- Accuracy
Datasets
- Dolma
- C4
- LeetCode
- WebInstruct
- Cosmopedia
Benchmarks
- MMLU
- CMMLU
- C-Eval
- SciEval

