Use small synthetic QA datasets and a PPL curriculum to boost Chinese and scientific reasoning in Llama‑3 with ~100B CPT tokens

July 26, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen

Links

Abstract / PDF

Why It Matters For Business

You can adapt a large English-centric LLM to Chinese and science tasks with modest additional pretraining (~100B tokens) and targeted synthetic QA, improving domain value without full retrain.

Summary TLDR

This paper reports a practical recipe for continual pre-training (CPT) of Llama‑3 (8B) to improve Chinese language ability and multidisciplinary scientific reasoning. The approach uses two CPT stages (bilingual adaptation and synthetic enhancement), topic-based data mixture, a perplexity (PPL) easy→hard curriculum, and large-scale synthetic QA data (1.5B tokens). On evaluated benchmarks, the continually pretrained model Llama‑3‑SynE improves Chinese and science scores substantially while largely retaining original skills, using about 100B CPT tokens. The team releases data, checkpoints, and code.

Problem Statement

Large LLMs trained mainly on English data (e.g., Llama‑3) underperform on Chinese tasks and some scientific reasoning. Continual pre‑training can adapt models but risks catastrophic forgetting. The practical question: how to design CPT data, mixture, and curriculum — and whether synthetic scientific QA helps — to add Chinese and scientific skills without losing original abilities under a limited token budget.

Main Contribution

A complete CPT pipeline for Llama‑3 (8B): bilingual adaptation followed by synthetic enhancement, with released data and code.

Design and validation of two data strategies: topic‑level mixture and a PPL (perplexity) based easy→hard curriculum for bilingual adaptation.

Large‑scale synthetic QA generation across nine scientific fields plus code QA; shows synthetic QA markedly improves scientific reasoning.

Surrogate tuning on TinyLlama (1.1B) to explore hyperparameters, then transfer the best settings to Llama‑3 (8B).

Key Findings

C-Eval (Chinese) improved by 8.81 points after CPT.

NumbersC‑Eval: 49.43 → 58.24 (+8.81)

CMMLU (Chinese multi‑task) improved by 6.31 points after CPT.

NumbersCMMLU: 51.03 → 57.34 (+6.31)

MATH (math reasoning) improved by 12.00 points after CPT.

NumbersMATH: 16.20 → 28.20 (+12.00)

SciEval (scientific reasoning average) improved by 4.13 points.

NumbersSciEval Avg: 65.47 → 69.60 (+4.13)

Total CPT token budget ≈ 100B tokens; synthetic data = 1.5B tokens.

NumbersTotal CPT corpus: 100.00B tokens; synthetic: 1.50B tokens

TinyLlama surrogate showed adding 1B synthetic tokens to 4B normal tokens outperformed 5B normal tokens.

NumbersTinyLlama avg major/science: w/5B(1B Syn.) > w/5B(1B Norm.)

Synthetic noise tolerance: low corruption (~30%) causes little harm; high corruption (>50%) degrades performance.

NumbersCorruption levels {0.0..0.7}: performance drops sharply after >0.5

Best synthetic ratio around 20% (TinyLlama), and easy→hard (low→high PPL) curriculum helps.

NumbersSynthetic ratios tested {10,20,30,40%}: 20% gave best average; LH curriculum outperformed HL and random

Results

C-Eval (few-shot, Chinese)

Value58.24

BaselineLlama-3 (8B) = 49.43

CMMLU (few-shot, Chinese multi-task)

Value57.34

BaselineLlama-3 (8B) = 51.03

MATH (few-shot, math reasoning)

Value28.20

BaselineLlama-3 (8B) = 16.20

SciEval (avg scientific reasoning)

Value69.60

BaselineLlama-3 (8B) = 65.47

HumanEval (code generation)

Value42.07

BaselineLlama-3 (8B) = 36.59

Total CPT tokens used

Value100.0B

Synthetic data size in released corpus

Value1.5B tokens

Who Should Care

What To Try In 7 Days

Run surrogate CPT on a small model (TinyLlama) with 4B normal + 1B synthetic tokens to test gains quickly.

Generate a 100k–1M synthetic QA seed for your domain and mix it ~20% into CPT data.

Implement a PPL-based curriculum: sort new-domain instances from low→high PPL and fine‑tune for a few billion tokens.

Optimization Features

Token Efficiency

  • CPT completed with ≈100B tokens
  • synthetic tokens comprise 1.5B of released corpus

Infra Optimization

  • DeepSpeed ZeRO Stage 2
  • HuggingFace Transformers

System Optimization

  • FlashAttention
  • gradient checkpointing
  • BFloat16 mixed precision

Training Optimization

  • two-stage CPT (bilingual adaptation → synthetic enhancement)
  • topic-based data mixture (topic classifiers + dynamic weights)
  • PPL-based data curriculum (easy→hard ordering)
  • use of a small surrogate model to find settings before scaling

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Slight underperformance on some English benchmarks (MMLU) reported; CPT can harm original skills if data mix is inappropriate.
  • Synthetic data quality matters: heavy corruption (>50%) degrades performance.
  • Topic separation (training by discipline blocks) did not help and sometimes hurt performance.

When Not To Use

  • When you require strict SOTA performance on original English-only benchmarks without any domain drift.
  • If you cannot guarantee reasonable quality of generated synthetic QA (no validation pipeline).
  • When you lack the compute to run ≈100B token CPT or an equivalent scaled workflow.

Failure Modes

  • Catastrophic forgetting if bilingual/synthetic ratios are misbalanced.
  • Performance drop from low-quality synthetic data (garbled or wrong answers).
  • Overfitting to QA format, hurting diverse downstream formats.

Core Entities

Models

  • Llama-3 (8B)
  • Llama-3-SynE
  • TinyLlama (1.1B)
  • Mistral-7B-Instruct-v0.3
  • Magicoder-S-DS-6.7B
  • GPT-4

Metrics

  • Accuracy
  • perplexity (PPL)

Datasets

  • Dolma CC subsets
  • C4
  • LeetCode
  • Yulan-3 corpus (reference)
  • WebInstruct
  • Cosmopedia

Benchmarks

  • C-Eval
  • CMMLU
  • MMLU
  • MATH
  • GSM8K
  • ASDiv
  • MAWPS
  • SAT-Math
  • HumanEval
  • MBPP
  • SciEval
  • SciQ
  • GaoKao
  • ARC
  • AQUA-RAT

Context Entities

Models

  • DCLM-7B
  • Mistral-7B-v0.3
  • MAmmoTH2-8B
  • Galactica-6.7B
  • Llama-3-Chinese-8B

Metrics

  • Accuracy

Datasets

  • Dolma
  • C4
  • LeetCode
  • WebInstruct
  • Cosmopedia

Benchmarks

  • MMLU
  • CMMLU
  • C-Eval
  • SciEval