CMMLU — a 11.5k-question Chinese multitask benchmark exposing limits of current LLMs

June 15, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is ready to use for Chinese evaluation and is backed by broad model tests; results are robust but remember dataset is multiple-choice and contains ~2% label noise.

Citations16

Evidence Strength0.80

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 80%

Novelty: 60%

Authors

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CMMLU shows current LLMs still miss large swaths of Chinese factual and reasoning knowledge. If your product targets Chinese users or policies, evaluate models on Chinese-specific data before deployment.

Who Should Care

Summary TLDR

CMMLU is a public Chinese multi-task multiple-choice benchmark with 11,528 questions across 67 subjects (STEM, humanities, social science, and China-specific topics). The authors evaluated GPT-4, ChatGPT and 20+ open models. Most models score well below a 60% pass threshold used in Chinese exams; GPT-4 scores ~71% (best) while many models hover near 40–60%. Chain-of-thought prompts often do not help and can break answer extraction. Negation and multi-part (sub-option) questions cause large drops. The dataset and code are released for evaluation.

Problem Statement

There is no comprehensive, culturally appropriate Chinese benchmark like MMLU to measure LLM knowledge and reasoning in Mandarin. English-centric benchmarks miss China-specific facts and language constructs, so a dedicated Chinese multitask test is needed.

Main Contribution

CMMLU: a public Chinese multi-task multiple-choice benchmark with 11,528 questions over 67 subjects and ≥105 questions per subject.

A large evaluation (GPT-4, ChatGPT, and 20+ open models) showing average performance gaps and topic imbalance (weaker on STEM and China-specific topics).

Key Findings

Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).

NumbersGPT4 70.95% (5-shot); ChatGPT 55.51%; many models 3062%

Practical UseDo not assume current LLMs reliably pass Chinese-standard exams; expect gaps in real-world Chinese knowledge and reasoning when deploying models.

Evidence RefTable 1 (five-shot averages)

CMMLU is large and broad: 11,528 questions covering 67 subjects with at least 105 questions per subject.

Numbers11,528 questions; 67 subjects; ≥105 questions/subject

Practical UseUse CMMLU for diverse Chinese evaluation and for tracking targeted improvements across subject areas.

Evidence RefSection 3 Statistics; Table 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT4 70.95%Random 25%GPT4 − random = +45.95 ptsCMMLU (five-shot)Table 1 (five-shot averages)Table 1
AccuracyBaichuan2-13B 61.92%Random 25%+36.92 ptsCMMLU (five-shot)Table 1 (five-shot averages)Table 1

What To Try In 7 Days

Run CMMLU (subset of relevant subjects) on candidate models to get a quick benchmark.

Compare next-token vs free-generation scoring; use next-token for faster, more stable MCQ evaluation.

Test prompt formats: avoid COT for Chinese MCQs unless validated; try one direct-answer and one few-shot prompt per model.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

About 2% estimated label noise from data collection and OCR.

Contains 16 China-specific subjects, so scores may not generalize outside Chinese context.

When Not To Use

To evaluate free-form or long-answer generation tasks.

For non-Chinese language capability assessment.

Failure Modes

Chain-of-thought prompts can produce long outputs that break regex answer extraction.

Negation and sub-option questions cause consistent performance drops.

Core Entities

Models

GPT4ChatGPTLLaMA2-70BLLaMA2-13BLLaMA2-7BLLaMA-65BLLaMA-30BLLaMA-13BFalcon-40BBLOOMZ-7BBaichuan2-13BBaichuan-13BBaichuan-7BInternLM-20BInternLM-7BXverse-13BChatGLM2-6BChatGLM-6BBatGPT-15BSFTChinese-GLM-10BZH LLaMA-13BBactrian-X (BX LLaMA variants)

Metrics

Accuracyrandom baseline (25%)

Datasets

CMMLU

Benchmarks

MMLUC-EvalM3KEAGIEvalCEval