Use small synthetic QA datasets and a PPL curriculum to boost Chinese and scientific reasoning in Llama‑3 with ~100B CPT tokens

July 26, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper provides clear, reproducible steps and concrete gains on standard benchmarks, but results come from a single backbone (Llama‑3 8B) and surrogate tuning; expect similar but not identical outcomes on other models.

Citations2

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 5/7

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.

Who Should Care

Summary TLDR

This paper reports a practical recipe for continual pre-training (CPT) of Llama‑3 (8B) to improve Chinese language ability and multidisciplinary scientific reasoning. The approach uses two CPT stages (bilingual adaptation and synthetic enhancement), topic-based data mixture, a perplexity (PPL) easy→hard curriculum, and large-scale synthetic QA data (1.5B tokens). On evaluated benchmarks, the continually pretrained model Llama‑3‑SynE improves Chinese and science scores substantially while largely retaining original skills, using about 100B CPT tokens. The team releases data, checkpoints, and code.

Problem Statement

Large LLMs trained mainly on English data (e.g., Llama‑3) underperform on Chinese tasks and some scientific reasoning. Continual pre‑training can adapt models but risks catastrophic forgetting. The practical question: how to design CPT data, mixture, and curriculum — and whether synthetic scientific QA helps — to add Chinese and scientific skills without losing original abilities under a limited token budget.

Main Contribution

A complete CPT pipeline for Llama‑3 (8B): bilingual adaptation followed by synthetic enhancement, with released data and code.

Design and validation of two data strategies: topic‑level mixture and a PPL (perplexity) based easy→hard curriculum for bilingual adaptation.

Key Findings

C-Eval (Chinese) improved by 8.81 points after CPT.

NumbersC‑Eval: 49.4358.24 (+8.81)

Practical UseIf you CPT Llama‑3 with the paper's recipe, expect large Chinese gains; good for Chinese-facing products.

Evidence RefTable 5

CMMLU (Chinese multi‑task) improved by 6.31 points after CPT.

NumbersCMMLU: 51.0357.34 (+6.31)

Practical UseMultidiscipline Chinese performance lifts across topics, useful for multilingual deployments.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
C-Eval (few-shot, Chinese)58.24Llama-3 (8B) = 49.43+8.81C-EvalTable 5 reports few-shot scoresTable 5
CMMLU (few-shot, Chinese multi-task)57.34Llama-3 (8B) = 51.03+6.31CMMLUTable 5 reports few-shot scoresTable 5

What To Try In 7 Days

Run surrogate CPT on a small model (TinyLlama) with 4B normal + 1B synthetic tokens to test gains quickly.

Generate a 100k–1M synthetic QA seed for your domain and mix it ~20% into CPT data.

Implement a PPL-based curriculum: sort new-domain instances from low→high PPL and fine‑tune for a few billion tokens.

Optimization Features

Token Efficiency
CPT completed with ≈100B tokenssynthetic tokens comprise 1.5B of released corpus
Infra Optimization
DeepSpeed ZeRO Stage 2HuggingFace Transformers
System Optimization
FlashAttentiongradient checkpointingBFloat16 mixed precision
Training Optimization
two-stage CPT (bilingual adaptation → synthetic enhancement)topic-based data mixture (topic classifiers + dynamic weights)PPL-based data curriculum (easy→hard ordering)use of a small surrogate model to find settings before scaling

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Slight underperformance on some English benchmarks (MMLU) reported; CPT can harm original skills if data mix is inappropriate.

Synthetic data quality matters: heavy corruption (>50%) degrades performance.

When Not To Use

When you require strict SOTA performance on original English-only benchmarks without any domain drift.

If you cannot guarantee reasonable quality of generated synthetic QA (no validation pipeline).

Failure Modes

Catastrophic forgetting if bilingual/synthetic ratios are misbalanced.

Performance drop from low-quality synthetic data (garbled or wrong answers).

Core Entities

Models

Llama-3 (8B)Llama-3-SynETinyLlama (1.1B)Mistral-7B-Instruct-v0.3Magicoder-S-DS-6.7BGPT-4

Metrics

Accuracyperplexity (PPL)

Datasets

Dolma CC subsetsC4LeetCodeYulan-3 corpus (reference)WebInstructCosmopedia

Benchmarks

C-EvalCMMLUMMLUMATHGSM8KASDivMAWPSSAT-MathHumanEvalMBPPSciEvalSciQGaoKaoARCAQUA-RAT

Context Entities

Models

DCLM-7BMistral-7B-v0.3MAmmoTH2-8BGalactica-6.7BLlama-3-Chinese-8B

Metrics

Accuracy

Datasets

DolmaC4LeetCodeWebInstructCosmopedia

Benchmarks

MMLUCMMLUC-EvalSciEval