CMMLU — a 11.5k-question Chinese multitask benchmark exposing limits of current LLMs

Overview

Decision SnapshotNeeds Validation

The benchmark is ready to use for Chinese evaluation and is backed by broad model tests; results are robust but remember dataset is multiple-choice and contains ~2% label noise.

Citations16

Evidence Strength0.80

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 80%

Novelty: 60%

Authors

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CMMLU shows current LLMs still miss large swaths of Chinese factual and reasoning knowledge. If your product targets Chinese users or policies, evaluate models on Chinese-specific data before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

CMMLU is a public Chinese multi-task multiple-choice benchmark with 11,528 questions across 67 subjects (STEM, humanities, social science, and China-specific topics). The authors evaluated GPT-4, ChatGPT and 20+ open models. Most models score well below a 60% pass threshold used in Chinese exams; GPT-4 scores ~71% (best) while many models hover near 40–60%. Chain-of-thought prompts often do not help and can break answer extraction. Negation and multi-part (sub-option) questions cause large drops. The dataset and code are released for evaluation.

Problem Statement

There is no comprehensive, culturally appropriate Chinese benchmark like MMLU to measure LLM knowledge and reasoning in Mandarin. English-centric benchmarks miss China-specific facts and language constructs, so a dedicated Chinese multitask test is needed.

Main Contribution

CMMLU: a public Chinese multi-task multiple-choice benchmark with 11,528 questions over 67 subjects and ≥105 questions per subject.

A large evaluation (GPT-4, ChatGPT, and 20+ open models) showing average performance gaps and topic imbalance (weaker on STEM and China-specific topics).

Key Findings

Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).

NumbersGPT4 70.95% (5-shot); ChatGPT 55.51%; many models 30–62%

Practical UseDo not assume current LLMs reliably pass Chinese-standard exams; expect gaps in real-world Chinese knowledge and reasoning when deploying models.

Evidence RefTable 1 (five-shot averages)

CMMLU is large and broad: 11,528 questions covering 67 subjects with at least 105 questions per subject.

Numbers11,528 questions; 67 subjects; ≥105 questions/subject

Practical UseUse CMMLU for diverse Chinese evaluation and for tracking targeted improvements across subject areas.

Evidence RefSection 3 Statistics; Table 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT4 70.95%	Random 25%	GPT4 − random = +45.95 pts	CMMLU (five-shot)	Table 1 (five-shot averages)	Table 1
Accuracy	Baichuan2-13B 61.92%	Random 25%	+36.92 pts	CMMLU (five-shot)	Table 1 (five-shot averages)	Table 1

What To Try In 7 Days

Run CMMLU (subset of relevant subjects) on candidate models to get a quick benchmark.

Compare next-token vs free-generation scoring; use next-token for faster, more stable MCQ evaluation.

Test prompt formats: avoid COT for Chinese MCQs unless validated; try one direct-answer and one few-shot prompt per model.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/haonan-li/CMMLU

Data URLs

https://github.com/haonan-li/CMMLU

Risks & Boundaries

Limitations

About 2% estimated label noise from data collection and OCR.

Contains 16 China-specific subjects, so scores may not generalize outside Chinese context.

When Not To Use

To evaluate free-form or long-answer generation tasks.

For non-Chinese language capability assessment.

Failure Modes

Chain-of-thought prompts can produce long outputs that break regex answer extraction.

Negation and sub-option questions cause consistent performance drops.

Core Entities

Models

GPT4ChatGPTLLaMA2-70BLLaMA2-13BLLaMA2-7BLLaMA-65BLLaMA-30BLLaMA-13BFalcon-40BBLOOMZ-7BBaichuan2-13BBaichuan-13BBaichuan-7BInternLM-20BInternLM-7BXverse-13BChatGLM2-6BChatGLM-6BBatGPT-15BSFTChinese-GLM-10BZH LLaMA-13BBactrian-X (BX LLaMA variants)

Metrics

Accuracyrandom baseline (25%)

Datasets

CMMLU

Benchmarks

MMLUC-EvalM3KEAGIEvalCEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).

CMMLU is large and broad: 11,528 questions covering 67 subjects with at least 105 questions per subject.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding