A 2B Chinese‑centric LLM trained from scratch on 800B Chinese tokens, plus an open Chinese corpus and a hard-case Chinese benchmark.

April 5, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides full data and training details and shows competitive Chinese performance for a 2B model; results are solid but not state-of-the-art versus larger models, so treat this as a practical mid-size Chinese option.

Citations5

Evidence Strength0.60

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Wenhu Chen, Ge Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.

Who Should Care

Summary TLDR

The authors build CT-LLM, a 2-billion-parameter transformer decoder model trained from scratch on a 1.25 trillion token mixture that is stacked toward Chinese (≈840B Chinese tokens). They release the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a multidisciplinary Chinese 'hard case' benchmark (CHC-Bench), and show that supervised fine-tuning (SFT) and direct preference optimization (DPO) improve helpfulness and safety. CT-LLM is competitive among 2B models on Chinese tasks, but it still trails larger multilingual models on many English-centric benchmarks.

Problem Statement

Most open LLMs are English‑heavy. The paper asks: can you train an LLM from scratch with Chinese as the dominant pretraining language and get a practical model for Chinese tasks? The work also seeks to provide a large, cleaned Chinese pretraining corpus and a benchmark that stresses Chinese cultural and language challenges.

Main Contribution

MAP-CC: an open Chinese-focused pretraining corpus (≈840.48B Chinese tokens) with documented filtering and deduplication.

CHC-Bench: a multidisciplinary Chinese hard-case instruction-following benchmark for cultural, classical, and exam-style problems.

Key Findings

They pretrain on a 1.2547 trillion token corpus with a Chinese majority.

Numbers1,254.68B total tokens; 840.48B Chinese, 314.88B English, 99.3B code

Practical UseIf you want Chinese fluency, bias pretraining toward large Chinese token counts (hundreds of billions) and apply targeted filtering/deduplication.

Evidence RefSection 3.1 (Pretraining data distribution)

Model scale and setup: CT-LLM is a 2B decoder transformer with long context.

Numbers2B parameters; 32 layers; context length 4096; vocab 125,696

Practical UseA 2B Chinese-centric model with 4k context can be trained and served on modest hardware compared with larger models; expect practical tradeoffs in accuracy vs cost.

Evidence RefSection 3.2 (Model Architecture) and Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pretraining corpus size1,254.68B tokens (840.48B Chinese)MAP-CCSection 3.1: dataset composition and token counts3.1
Model scale2B parameters; 4k contextCT-LLM architectureSection 3.2 and Table 13.2

What To Try In 7 Days

Download MAP-CC samples and inspect filters to replicate Chinese data cleaning.

Fine-tune an existing 2B model with a 2:1 Chinese:English SFT mix and compare Chinese task outputs.

Run the CHC-Bench subset on your model to surface Chinese-specific failure modes.

Optimization Features

Token Efficiency
shared input-output embeddingsBPE tokenizer that splits numbers
Model Optimization
RoPE positional embeddingsSwiGLU activationsRMSNorm
Training Optimization
Large-scale Chinese-heavy pretraining (840B Chinese tokens)Heuristic filtering adapted for ChineseMulti-stage deduplication (Bloom filter, MinHash LSH, similar-line)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Model is 2B — cannot match larger models on many English benchmarks.

CHC-Bench scores show room for improvement versus top aligned models.

When Not To Use

When you need top-tier English or cross-lingual SOTA performance.

When extreme numerical or code accuracy is required (low pass@1 on HumanEval).

Failure Modes

Weaker code generation and math (low HumanEval / GSM8K compared with larger models).

Cultural bias reflecting Chinese web text (model clusters as 'community-focused').

Core Entities

Models

CT-LLMSFTQwen-7B

Metrics

Accuracypass@1perplexityGPT-4 scoreCvalues-QA (score)

Datasets

MAP-CCCHC-BenchCQIAOL-CCCOIG-PCOpenHermesPreferencesCommon Crawl

Benchmarks

CHC-BenchMMLUC-EvalCMMLUHellaSwagGSM8KHumanEvalCvalues

Context Entities

Models

Phi-2Gemma-2bQwen1.5TinyLlamaStablelm-zephyr

Metrics

training tokenscheckpoint evaluations

Datasets

RefinedWebCCNetWudaoSkyPile

Benchmarks

MT-BenchTriviaQASQuAD2.0MBPP