Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
16
Why It Matters For Business
CMMLU shows current LLMs still miss large swaths of Chinese factual and reasoning knowledge. If your product targets Chinese users or policies, evaluate models on Chinese-specific data before deployment.
Summary TLDR
CMMLU is a public Chinese multi-task multiple-choice benchmark with 11,528 questions across 67 subjects (STEM, humanities, social science, and China-specific topics). The authors evaluated GPT-4, ChatGPT and 20+ open models. Most models score well below a 60% pass threshold used in Chinese exams; GPT-4 scores ~71% (best) while many models hover near 40–60%. Chain-of-thought prompts often do not help and can break answer extraction. Negation and multi-part (sub-option) questions cause large drops. The dataset and code are released for evaluation.
Problem Statement
There is no comprehensive, culturally appropriate Chinese benchmark like MMLU to measure LLM knowledge and reasoning in Mandarin. English-centric benchmarks miss China-specific facts and language constructs, so a dedicated Chinese multitask test is needed.
Main Contribution
CMMLU: a public Chinese multi-task multiple-choice benchmark with 11,528 questions over 67 subjects and ≥105 questions per subject.
A large evaluation (GPT-4, ChatGPT, and 20+ open models) showing average performance gaps and topic imbalance (weaker on STEM and China-specific topics).
Empirical analyses of factors that affect performance: chain-of-thought prompts, few-shot examples, model size, negation, and sub-option questions.
Practical evaluation recipe and comparison of three multiple-choice scoring strategies (next-token, perplexity, free generation) and regex extraction code.
Key Findings
Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).
CMMLU is large and broad: 11,528 questions covering 67 subjects with at least 105 questions per subject.
Chain-of-thought (COT) prompts usually do not improve—and often reduce—multiple-choice accuracy on CMMLU.
Few-shot examples help base (pretrained) models but not instruction-finetuned (SFT/RLHF) chat models.
Negation and sub-option question formats significantly reduce accuracy for most models.
Results
Accuracy
Accuracy
Accuracy
Dataset size
Effect of chain-of-thought (COT)
Sub-option question impact (GPT4)
Who Should Care
What To Try In 7 Days
Run CMMLU (subset of relevant subjects) on candidate models to get a quick benchmark.
Compare next-token vs free-generation scoring; use next-token for faster, more stable MCQ evaluation.
Test prompt formats: avoid COT for Chinese MCQs unless validated; try one direct-answer and one few-shot prompt per model.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- About 2% estimated label noise from data collection and OCR.
- Contains 16 China-specific subjects, so scores may not generalize outside Chinese context.
- Multiple-choice format only; does not test free-form generation or long-form reasoning.
- Answer-extraction for free-generation needs robust regex/modeling and can miss outputs.
When Not To Use
- To evaluate free-form or long-answer generation tasks.
- For non-Chinese language capability assessment.
- As the sole test for model safety, bias, or hallucination behaviors.
Failure Modes
- Chain-of-thought prompts can produce long outputs that break regex answer extraction.
- Negation and sub-option questions cause consistent performance drops.
- Multilingual models often underperform on China-specific items due to data mismatch.
- Next-token strategy may fail if model does not produce a single-letter answer token without examples.
Core Entities
Models
- GPT4
- ChatGPT
- LLaMA2-70B
- LLaMA2-13B
- LLaMA2-7B
- LLaMA-65B
- LLaMA-30B
- LLaMA-13B
- Falcon-40B
- BLOOMZ-7B
- Baichuan2-13B
- Baichuan-13B
- Baichuan-7B
- InternLM-20B
- InternLM-7B
- Xverse-13B
- ChatGLM2-6B
- ChatGLM-6B
- BatGPT-15B
- SFT
- Chinese-GLM-10B
- ZH LLaMA-13B
- Bactrian-X (BX LLaMA variants)
Metrics
- Accuracy
- random baseline (25%)
Datasets
- CMMLU
Benchmarks
- MMLU
- C-Eval
- M3KE
- AGIEval
- CEval

