A 2B Chinese‑centric LLM trained from scratch on 800B Chinese tokens, plus an open Chinese corpus and a hard-case Chinese benchmark.

April 5, 20248 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

5

Authors

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Wenhu Chen, Ge Zhang

Links

Abstract / PDF

Why It Matters For Business

If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.

Summary TLDR

The authors build CT-LLM, a 2-billion-parameter transformer decoder model trained from scratch on a 1.25 trillion token mixture that is stacked toward Chinese (≈840B Chinese tokens). They release the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a multidisciplinary Chinese 'hard case' benchmark (CHC-Bench), and show that supervised fine-tuning (SFT) and direct preference optimization (DPO) improve helpfulness and safety. CT-LLM is competitive among 2B models on Chinese tasks, but it still trails larger multilingual models on many English-centric benchmarks.

Problem Statement

Most open LLMs are English‑heavy. The paper asks: can you train an LLM from scratch with Chinese as the dominant pretraining language and get a practical model for Chinese tasks? The work also seeks to provide a large, cleaned Chinese pretraining corpus and a benchmark that stresses Chinese cultural and language challenges.

Main Contribution

MAP-CC: an open Chinese-focused pretraining corpus (≈840.48B Chinese tokens) with documented filtering and deduplication.

CHC-Bench: a multidisciplinary Chinese hard-case instruction-following benchmark for cultural, classical, and exam-style problems.

CT-LLM: a 2B-parameter Chinese-centric transformer model, plus SFT and DPO alignment recipes and evaluation across Chinese and English benchmarks.

Key Findings

They pretrain on a 1.2547 trillion token corpus with a Chinese majority.

Numbers1,254.68B total tokens; 840.48B Chinese, 314.88B English, 99.3B code

Model scale and setup: CT-LLM is a 2B decoder transformer with long context.

Numbers2B parameters; 32 layers; context length 4096; vocab 125,696

On CHC-Bench (Chinese hard-case) CT-LLM outperforms many 1–3B models but is below top 2B and larger-chat models.

NumbersCHC-Bench overall score: CT-LLM 3.99 vs MiniCPM-2B 6.95 and Qwen1.5-1.8B 6.57

SFT ratio and preference tuning affect performance and safety; 2:1 Chinese:English SFT worked well and DPO improved safety scores.

NumbersCT-LLM-SFT (ZH:EN=2:1) best overall; Cvalues-MC safety: CT-LLM-SFT-DPO = 0.795, Cvalues-QA = 5.61

Pretraining scale yields steady gains on reasoning and Chinese tasks.

NumbersHellaSwag from 33.3 -> 50.37; MMLU from 26.09 -> 37.11 over checkpoints

Results

Pretraining corpus size

Value1,254.68B tokens (840.48B Chinese)

Model scale

Value2B parameters; 4k context

CHC-Bench overall score (2B models)

ValueCT-LLM = 3.99

BaselineMiniCPM-2B = 6.95

Safety (Cvalues)

ValueCT-LLM-SFT-DPO Cvalues-MC = 0.795; Cvalues-QA = 5.61

BaselineMiniCPM-2B Cvalues-MC = 0.851

Pretraining scale gains example

ValueHellaSwag 33.3 -> 50.37

Baselinecheckpoint at 39.9B tokens

Who Should Care

What To Try In 7 Days

Download MAP-CC samples and inspect filters to replicate Chinese data cleaning.

Fine-tune an existing 2B model with a 2:1 Chinese:English SFT mix and compare Chinese task outputs.

Run the CHC-Bench subset on your model to surface Chinese-specific failure modes.

Optimization Features

Token Efficiency

  • shared input-output embeddings
  • BPE tokenizer that splits numbers

Model Optimization

  • RoPE positional embeddings
  • SwiGLU activations
  • RMSNorm

Training Optimization

  • Large-scale Chinese-heavy pretraining (840B Chinese tokens)
  • Heuristic filtering adapted for Chinese
  • Multi-stage deduplication (Bloom filter, MinHash LSH, similar-line)

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model is 2B — cannot match larger models on many English benchmarks.
  • CHC-Bench scores show room for improvement versus top aligned models.
  • Some duplicates were intentionally kept; that can bias English-heavy signals.

When Not To Use

  • When you need top-tier English or cross-lingual SOTA performance.
  • When extreme numerical or code accuracy is required (low pass@1 on HumanEval).
  • When you must avoid any cultural bias without extensive additional alignment.

Failure Modes

  • Weaker code generation and math (low HumanEval / GSM8K compared with larger models).
  • Cultural bias reflecting Chinese web text (model clusters as 'community-focused').
  • Remaining hallucinations and limited depth on harder STEM items.

Core Entities

Models

  • CT-LLM
  • SFT
  • Qwen-7B

Metrics

  • Accuracy
  • pass@1
  • perplexity
  • GPT-4 score
  • Cvalues-QA (score)

Datasets

  • MAP-CC
  • CHC-Bench
  • CQIA
  • OL-CC
  • COIG-PC
  • OpenHermesPreferences
  • Common Crawl

Benchmarks

  • CHC-Bench
  • MMLU
  • C-Eval
  • CMMLU
  • HellaSwag
  • GSM8K
  • HumanEval
  • Cvalues

Context Entities

Models

  • Phi-2
  • Gemma-2b
  • Qwen1.5
  • TinyLlama
  • Stablelm-zephyr

Metrics

  • training tokens
  • checkpoint evaluations

Datasets

  • RefinedWeb
  • CCNet
  • Wudao
  • SkyPile

Benchmarks

  • MT-Bench
  • TriviaQA
  • SQuAD2.0
  • MBPP