Overview
The benchmark is ready to use for Chinese evaluation and is backed by broad model tests; results are robust but remember dataset is multiple-choice and contains ~2% label noise.
Citations16
Evidence Strength0.80
Confidence0.88
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
CMMLU shows current LLMs still miss large swaths of Chinese factual and reasoning knowledge. If your product targets Chinese users or policies, evaluate models on Chinese-specific data before deployment.
Who Should Care
Summary TLDR
CMMLU is a public Chinese multi-task multiple-choice benchmark with 11,528 questions across 67 subjects (STEM, humanities, social science, and China-specific topics). The authors evaluated GPT-4, ChatGPT and 20+ open models. Most models score well below a 60% pass threshold used in Chinese exams; GPT-4 scores ~71% (best) while many models hover near 40–60%. Chain-of-thought prompts often do not help and can break answer extraction. Negation and multi-part (sub-option) questions cause large drops. The dataset and code are released for evaluation.
Problem Statement
There is no comprehensive, culturally appropriate Chinese benchmark like MMLU to measure LLM knowledge and reasoning in Mandarin. English-centric benchmarks miss China-specific facts and language constructs, so a dedicated Chinese multitask test is needed.
Main Contribution
CMMLU: a public Chinese multi-task multiple-choice benchmark with 11,528 questions over 67 subjects and ≥105 questions per subject.
A large evaluation (GPT-4, ChatGPT, and 20+ open models) showing average performance gaps and topic imbalance (weaker on STEM and China-specific topics).
Key Findings
Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).
CMMLU is large and broad: 11,528 questions covering 67 subjects with at least 105 questions per subject.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT4 70.95% | Random 25% | GPT4 − random = +45.95 pts | CMMLU (five-shot) | Table 1 (five-shot averages) | Table 1 |
| Accuracy | Baichuan2-13B 61.92% | Random 25% | +36.92 pts | CMMLU (five-shot) | Table 1 (five-shot averages) | Table 1 |
What To Try In 7 Days
Run CMMLU (subset of relevant subjects) on candidate models to get a quick benchmark.
Compare next-token vs free-generation scoring; use next-token for faster, more stable MCQ evaluation.
Test prompt formats: avoid COT for Chinese MCQs unless validated; try one direct-answer and one few-shot prompt per model.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
About 2% estimated label noise from data collection and OCR.
Contains 16 China-specific subjects, so scores may not generalize outside Chinese context.
When Not To Use
To evaluate free-form or long-answer generation tasks.
For non-Chinese language capability assessment.
Failure Modes
Chain-of-thought prompts can produce long outputs that break regex answer extraction.
Negation and sub-option questions cause consistent performance drops.

