Use small synthetic QA datasets and a PPL curriculum to boost Chinese and scientific reasoning in Llama‑3 with ~100B CPT tokens

Overview

Decision SnapshotReady For Pilot

The paper provides clear, reproducible steps and concrete gains on standard benchmarks, but results come from a single backbone (Llama‑3 8B) and surrogate tuning; expect similar but not identical outcomes on other models.

Citations2

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 5/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper reports a practical recipe for continual pre-training (CPT) of Llama‑3 (8B) to improve Chinese language ability and multidisciplinary scientific reasoning. The approach uses two CPT stages (bilingual adaptation and synthetic enhancement), topic-based data mixture, a perplexity (PPL) easy→hard curriculum, and large-scale synthetic QA data (1.5B tokens). On evaluated benchmarks, the continually pretrained model Llama‑3‑SynE improves Chinese and science scores substantially while largely retaining original skills, using about 100B CPT tokens. The team releases data, checkpoints, and code.

Problem Statement

Large LLMs trained mainly on English data (e.g., Llama‑3) underperform on Chinese tasks and some scientific reasoning. Continual pre‑training can adapt models but risks catastrophic forgetting. The practical question: how to design CPT data, mixture, and curriculum — and whether synthetic scientific QA helps — to add Chinese and scientific skills without losing original abilities under a limited token budget.

Main Contribution

A complete CPT pipeline for Llama‑3 (8B): bilingual adaptation followed by synthetic enhancement, with released data and code.

Design and validation of two data strategies: topic‑level mixture and a PPL (perplexity) based easy→hard curriculum for bilingual adaptation.

Key Findings

C-Eval (Chinese) improved by 8.81 points after CPT.

NumbersC‑Eval: 49.43 → 58.24 (+8.81)

Practical UseIf you CPT Llama‑3 with the paper's recipe, expect large Chinese gains; good for Chinese-facing products.

Evidence RefTable 5

CMMLU (Chinese multi‑task) improved by 6.31 points after CPT.

NumbersCMMLU: 51.03 → 57.34 (+6.31)

Practical UseMultidiscipline Chinese performance lifts across topics, useful for multilingual deployments.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
C-Eval (few-shot, Chinese)	58.24	Llama-3 (8B) = 49.43	+8.81	C-Eval	Table 5 reports few-shot scores	Table 5
CMMLU (few-shot, Chinese multi-task)	57.34	Llama-3 (8B) = 51.03	+6.31	CMMLU	Table 5 reports few-shot scores	Table 5

What To Try In 7 Days

Run surrogate CPT on a small model (TinyLlama) with 4B normal + 1B synthetic tokens to test gains quickly.

Generate a 100k–1M synthetic QA seed for your domain and mix it ~20% into CPT data.

Implement a PPL-based curriculum: sort new-domain instances from low→high PPL and fine‑tune for a few billion tokens.

Optimization Features

Token Efficiency

CPT completed with ≈100B tokenssynthetic tokens comprise 1.5B of released corpus

Infra Optimization

DeepSpeed ZeRO Stage 2HuggingFace Transformers

System Optimization

FlashAttentiongradient checkpointingBFloat16 mixed precision

Training Optimization

two-stage CPT (bilingual adaptation → synthetic enhancement)topic-based data mixture (topic classifiers + dynamic weights)PPL-based data curriculum (easy→hard ordering)use of a small surrogate model to find settings before scaling

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/RUC-GSAI/Llama-3-SynE

Data URLs

https://github.com/RUC-GSAI/Llama-3-SynE

Risks & Boundaries

Limitations

Slight underperformance on some English benchmarks (MMLU) reported; CPT can harm original skills if data mix is inappropriate.

Synthetic data quality matters: heavy corruption (>50%) degrades performance.

When Not To Use

When you require strict SOTA performance on original English-only benchmarks without any domain drift.

If you cannot guarantee reasonable quality of generated synthetic QA (no validation pipeline).

Failure Modes

Catastrophic forgetting if bilingual/synthetic ratios are misbalanced.

Performance drop from low-quality synthetic data (garbled or wrong answers).

Core Entities

Models

Llama-3 (8B)Llama-3-SynETinyLlama (1.1B)Mistral-7B-Instruct-v0.3Magicoder-S-DS-6.7BGPT-4

Metrics

Accuracyperplexity (PPL)

Datasets

Dolma CC subsetsC4LeetCodeYulan-3 corpus (reference)WebInstructCosmopedia

Benchmarks

C-EvalCMMLUMMLUMATHGSM8KASDivMAWPSSAT-MathHumanEvalMBPPSciEvalSciQGaoKaoARCAQUA-RAT

Context Entities

Models

DCLM-7BMistral-7B-v0.3MAmmoTH2-8BGalactica-6.7BLlama-3-Chinese-8B

Metrics

Accuracy

Datasets

DolmaC4LeetCodeWebInstructCosmopedia

Benchmarks

MMLUCMMLUC-EvalSciEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

C-Eval (Chinese) improved by 8.81 points after CPT.

CMMLU (Chinese multi‑task) improved by 6.31 points after CPT.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding