Overview
The paper provides full data and training details and shows competitive Chinese performance for a 2B model; results are solid but not state-of-the-art versus larger models, so treat this as a practical mid-size Chinese option.
Citations5
Evidence Strength0.60
Confidence0.75
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.
Who Should Care
Summary TLDR
The authors build CT-LLM, a 2-billion-parameter transformer decoder model trained from scratch on a 1.25 trillion token mixture that is stacked toward Chinese (≈840B Chinese tokens). They release the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a multidisciplinary Chinese 'hard case' benchmark (CHC-Bench), and show that supervised fine-tuning (SFT) and direct preference optimization (DPO) improve helpfulness and safety. CT-LLM is competitive among 2B models on Chinese tasks, but it still trails larger multilingual models on many English-centric benchmarks.
Problem Statement
Most open LLMs are English‑heavy. The paper asks: can you train an LLM from scratch with Chinese as the dominant pretraining language and get a practical model for Chinese tasks? The work also seeks to provide a large, cleaned Chinese pretraining corpus and a benchmark that stresses Chinese cultural and language challenges.
Main Contribution
MAP-CC: an open Chinese-focused pretraining corpus (≈840.48B Chinese tokens) with documented filtering and deduplication.
CHC-Bench: a multidisciplinary Chinese hard-case instruction-following benchmark for cultural, classical, and exam-style problems.
Key Findings
They pretrain on a 1.2547 trillion token corpus with a Chinese majority.
Model scale and setup: CT-LLM is a 2B decoder transformer with long context.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pretraining corpus size | 1,254.68B tokens (840.48B Chinese) | — | — | MAP-CC | Section 3.1: dataset composition and token counts | 3.1 |
| Model scale | 2B parameters; 4k context | — | — | CT-LLM architecture | Section 3.2 and Table 1 | 3.2 |
What To Try In 7 Days
Download MAP-CC samples and inspect filters to replicate Chinese data cleaning.
Fine-tune an existing 2B model with a 2:1 Chinese:English SFT mix and compare Chinese task outputs.
Run the CHC-Bench subset on your model to surface Chinese-specific failure modes.
Optimization Features
Token Efficiency
Model Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Model is 2B — cannot match larger models on many English benchmarks.
CHC-Bench scores show room for improvement versus top aligned models.
When Not To Use
When you need top-tier English or cross-lingual SOTA performance.
When extreme numerical or code accuracy is required (low pass@1 on HumanEval).
Failure Modes
Weaker code generation and math (low HumanEval / GSM8K compared with larger models).
Cultural bias reflecting Chinese web text (model clusters as 'community-focused').

