A 2B Chinese‑centric LLM trained from scratch on 800B Chinese tokens, plus an open Chinese corpus and a hard-case Chinese benchmark.

Overview

Decision SnapshotNeeds Validation

The paper provides full data and training details and shows competitive Chinese performance for a 2B model; results are solid but not state-of-the-art versus larger models, so treat this as a practical mid-size Chinese option.

Citations5

Evidence Strength0.60

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Wenhu Chen, Ge Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product targets Chinese users, pretraining with a large Chinese-majority corpus plus Chinese-heavy SFT yields better cultural knowledge and instruction following than adapting an English-first model.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Founder

Summary TLDR

The authors build CT-LLM, a 2-billion-parameter transformer decoder model trained from scratch on a 1.25 trillion token mixture that is stacked toward Chinese (≈840B Chinese tokens). They release the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a multidisciplinary Chinese 'hard case' benchmark (CHC-Bench), and show that supervised fine-tuning (SFT) and direct preference optimization (DPO) improve helpfulness and safety. CT-LLM is competitive among 2B models on Chinese tasks, but it still trails larger multilingual models on many English-centric benchmarks.

Problem Statement

Most open LLMs are English‑heavy. The paper asks: can you train an LLM from scratch with Chinese as the dominant pretraining language and get a practical model for Chinese tasks? The work also seeks to provide a large, cleaned Chinese pretraining corpus and a benchmark that stresses Chinese cultural and language challenges.

Main Contribution

MAP-CC: an open Chinese-focused pretraining corpus (≈840.48B Chinese tokens) with documented filtering and deduplication.

CHC-Bench: a multidisciplinary Chinese hard-case instruction-following benchmark for cultural, classical, and exam-style problems.

Key Findings

They pretrain on a 1.2547 trillion token corpus with a Chinese majority.

Numbers1,254.68B total tokens; 840.48B Chinese, 314.88B English, 99.3B code

Practical UseIf you want Chinese fluency, bias pretraining toward large Chinese token counts (hundreds of billions) and apply targeted filtering/deduplication.

Evidence RefSection 3.1 (Pretraining data distribution)

Model scale and setup: CT-LLM is a 2B decoder transformer with long context.

Numbers2B parameters; 32 layers; context length 4096; vocab 125,696

Practical UseA 2B Chinese-centric model with 4k context can be trained and served on modest hardware compared with larger models; expect practical tradeoffs in accuracy vs cost.

Evidence RefSection 3.2 (Model Architecture) and Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pretraining corpus size	1,254.68B tokens (840.48B Chinese)	—	—	MAP-CC	Section 3.1: dataset composition and token counts	3.1
Model scale	2B parameters; 4k context	—	—	CT-LLM architecture	Section 3.2 and Table 1	3.2

What To Try In 7 Days

Download MAP-CC samples and inspect filters to replicate Chinese data cleaning.

Fine-tune an existing 2B model with a 2:1 Chinese:English SFT mix and compare Chinese task outputs.

Run the CHC-Bench subset on your model to surface Chinese-specific failure modes.

Optimization Features

Token Efficiency

shared input-output embeddingsBPE tokenizer that splits numbers

Model Optimization

RoPE positional embeddingsSwiGLU activationsRMSNorm

Training Optimization

Large-scale Chinese-heavy pretraining (840B Chinese tokens)Heuristic filtering adapted for ChineseMulti-stage deduplication (Bloom filter, MinHash LSH, similar-line)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://chinese-tiny-llm.github.io/

Data URLs

https://chinese-tiny-llm.github.io/

Risks & Boundaries

Limitations

Model is 2B — cannot match larger models on many English benchmarks.

CHC-Bench scores show room for improvement versus top aligned models.

When Not To Use

When you need top-tier English or cross-lingual SOTA performance.

When extreme numerical or code accuracy is required (low pass@1 on HumanEval).

Failure Modes

Weaker code generation and math (low HumanEval / GSM8K compared with larger models).

Cultural bias reflecting Chinese web text (model clusters as 'community-focused').

Core Entities

Models

CT-LLMSFTQwen-7B

Metrics

Accuracypass@1perplexityGPT-4 scoreCvalues-QA (score)

Datasets

MAP-CCCHC-BenchCQIAOL-CCCOIG-PCOpenHermesPreferencesCommon Crawl

Benchmarks

CHC-BenchMMLUC-EvalCMMLUHellaSwagGSM8KHumanEvalCvalues

Context Entities

Models

Phi-2Gemma-2bQwen1.5TinyLlamaStablelm-zephyr

Metrics

training tokenscheckpoint evaluations

Datasets

RefinedWebCCNetWudaoSkyPile

Benchmarks

MT-BenchTriviaQASQuAD2.0MBPP

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

They pretrain on a 1.2547 trillion token corpus with a Chinese majority.

Model scale and setup: CT-LLM is a 2B decoder transformer with long context.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding