Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.
Summary TLDR
The authors build CT-LLM, a 2-billion-parameter transformer decoder model trained from scratch on a 1.25 trillion token mixture that is stacked toward Chinese (≈840B Chinese tokens). They release the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a multidisciplinary Chinese 'hard case' benchmark (CHC-Bench), and show that supervised fine-tuning (SFT) and direct preference optimization (DPO) improve helpfulness and safety. CT-LLM is competitive among 2B models on Chinese tasks, but it still trails larger multilingual models on many English-centric benchmarks.
Problem Statement
Most open LLMs are English‑heavy. The paper asks: can you train an LLM from scratch with Chinese as the dominant pretraining language and get a practical model for Chinese tasks? The work also seeks to provide a large, cleaned Chinese pretraining corpus and a benchmark that stresses Chinese cultural and language challenges.
Main Contribution
MAP-CC: an open Chinese-focused pretraining corpus (≈840.48B Chinese tokens) with documented filtering and deduplication.
CHC-Bench: a multidisciplinary Chinese hard-case instruction-following benchmark for cultural, classical, and exam-style problems.
CT-LLM: a 2B-parameter Chinese-centric transformer model, plus SFT and DPO alignment recipes and evaluation across Chinese and English benchmarks.
Key Findings
They pretrain on a 1.2547 trillion token corpus with a Chinese majority.
Model scale and setup: CT-LLM is a 2B decoder transformer with long context.
On CHC-Bench (Chinese hard-case) CT-LLM outperforms many 1–3B models but is below top 2B and larger-chat models.
SFT ratio and preference tuning affect performance and safety; 2:1 Chinese:English SFT worked well and DPO improved safety scores.
Pretraining scale yields steady gains on reasoning and Chinese tasks.
Results
Pretraining corpus size
Model scale
CHC-Bench overall score (2B models)
Safety (Cvalues)
Pretraining scale gains example
Who Should Care
What To Try In 7 Days
Download MAP-CC samples and inspect filters to replicate Chinese data cleaning.
Fine-tune an existing 2B model with a 2:1 Chinese:English SFT mix and compare Chinese task outputs.
Run the CHC-Bench subset on your model to surface Chinese-specific failure modes.
Optimization Features
Token Efficiency
- shared input-output embeddings
- BPE tokenizer that splits numbers
Model Optimization
- RoPE positional embeddings
- SwiGLU activations
- RMSNorm
Training Optimization
- Large-scale Chinese-heavy pretraining (840B Chinese tokens)
- Heuristic filtering adapted for Chinese
- Multi-stage deduplication (Bloom filter, MinHash LSH, similar-line)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Model is 2B — cannot match larger models on many English benchmarks.
- CHC-Bench scores show room for improvement versus top aligned models.
- Some duplicates were intentionally kept; that can bias English-heavy signals.
When Not To Use
- When you need top-tier English or cross-lingual SOTA performance.
- When extreme numerical or code accuracy is required (low pass@1 on HumanEval).
- When you must avoid any cultural bias without extensive additional alignment.
Failure Modes
- Weaker code generation and math (low HumanEval / GSM8K compared with larger models).
- Cultural bias reflecting Chinese web text (model clusters as 'community-focused').
- Remaining hallucinations and limited depth on harder STEM items.
Core Entities
Models
- CT-LLM
- SFT
- Qwen-7B
Metrics
- Accuracy
- pass@1
- perplexity
- GPT-4 score
- Cvalues-QA (score)
Datasets
- MAP-CC
- CHC-Bench
- CQIA
- OL-CC
- COIG-PC
- OpenHermesPreferences
- Common Crawl
Benchmarks
- CHC-Bench
- MMLU
- C-Eval
- CMMLU
- HellaSwag
- GSM8K
- HumanEval
- Cvalues
Context Entities
Models
- Phi-2
- Gemma-2b
- Qwen1.5
- TinyLlama
- Stablelm-zephyr
Metrics
- training tokens
- checkpoint evaluations
Datasets
- RefinedWeb
- CCNet
- Wudao
- SkyPile
Benchmarks
- MT-Bench
- TriviaQA
- SQuAD2.0
- MBPP

