CMMLU — a 11.5k-question Chinese multitask benchmark exposing limits of current LLMs

June 15, 20237 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

16

Authors

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin

Links

Abstract / PDF

Why It Matters For Business

CMMLU shows current LLMs still miss large swaths of Chinese factual and reasoning knowledge. If your product targets Chinese users or policies, evaluate models on Chinese-specific data before deployment.

Summary TLDR

CMMLU is a public Chinese multi-task multiple-choice benchmark with 11,528 questions across 67 subjects (STEM, humanities, social science, and China-specific topics). The authors evaluated GPT-4, ChatGPT and 20+ open models. Most models score well below a 60% pass threshold used in Chinese exams; GPT-4 scores ~71% (best) while many models hover near 40–60%. Chain-of-thought prompts often do not help and can break answer extraction. Negation and multi-part (sub-option) questions cause large drops. The dataset and code are released for evaluation.

Problem Statement

There is no comprehensive, culturally appropriate Chinese benchmark like MMLU to measure LLM knowledge and reasoning in Mandarin. English-centric benchmarks miss China-specific facts and language constructs, so a dedicated Chinese multitask test is needed.

Main Contribution

CMMLU: a public Chinese multi-task multiple-choice benchmark with 11,528 questions over 67 subjects and ≥105 questions per subject.

A large evaluation (GPT-4, ChatGPT, and 20+ open models) showing average performance gaps and topic imbalance (weaker on STEM and China-specific topics).

Empirical analyses of factors that affect performance: chain-of-thought prompts, few-shot examples, model size, negation, and sub-option questions.

Practical evaluation recipe and comparison of three multiple-choice scoring strategies (next-token, perplexity, free generation) and regex extraction code.

Key Findings

Most evaluated LLMs score below a 60% pass mark on CMMLU (Chinese-exam pass = 60%).

NumbersGPT4 70.95% (5-shot); ChatGPT 55.51%; many models 30–62%

CMMLU is large and broad: 11,528 questions covering 67 subjects with at least 105 questions per subject.

Numbers11,528 questions; 67 subjects; ≥105 questions/subject

Chain-of-thought (COT) prompts usually do not improve—and often reduce—multiple-choice accuracy on CMMLU.

NumbersBaichuan2-13B overall −6.0 pts with COT; Xverse −13.7 pts; ChatGPT −0.4 pts

Few-shot examples help base (pretrained) models but not instruction-finetuned (SFT/RLHF) chat models.

Negation and sub-option question formats significantly reduce accuracy for most models.

NumbersSub-options cause ~10–20% drops; GPT4: 69.74% → 51.14% (−18.6 pts); negation gap often ~5–13%, Baichuan2 gap ≈2.4%

Results

Accuracy

ValueGPT4 70.95%

BaselineRandom 25%

Accuracy

ValueBaichuan2-13B 61.92%

BaselineRandom 25%

Accuracy

ValueChatGPT 55.51%

BaselineRandom 25%

Dataset size

Value11,528 questions across 67 subjects

Effect of chain-of-thought (COT)

ValueBaichuan2-13B overall −6.0 percentage points with COT

BaselineBaichuan2-13B (zero-shot direct-answer)

Sub-option question impact (GPT4)

Value69.74% → 51.14% (−18.6 pts)

BaselineGPT4 on non-sub-option questions

Who Should Care

What To Try In 7 Days

Run CMMLU (subset of relevant subjects) on candidate models to get a quick benchmark.

Compare next-token vs free-generation scoring; use next-token for faster, more stable MCQ evaluation.

Test prompt formats: avoid COT for Chinese MCQs unless validated; try one direct-answer and one few-shot prompt per model.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • About 2% estimated label noise from data collection and OCR.
  • Contains 16 China-specific subjects, so scores may not generalize outside Chinese context.
  • Multiple-choice format only; does not test free-form generation or long-form reasoning.
  • Answer-extraction for free-generation needs robust regex/modeling and can miss outputs.

When Not To Use

  • To evaluate free-form or long-answer generation tasks.
  • For non-Chinese language capability assessment.
  • As the sole test for model safety, bias, or hallucination behaviors.

Failure Modes

  • Chain-of-thought prompts can produce long outputs that break regex answer extraction.
  • Negation and sub-option questions cause consistent performance drops.
  • Multilingual models often underperform on China-specific items due to data mismatch.
  • Next-token strategy may fail if model does not produce a single-letter answer token without examples.

Core Entities

Models

  • GPT4
  • ChatGPT
  • LLaMA2-70B
  • LLaMA2-13B
  • LLaMA2-7B
  • LLaMA-65B
  • LLaMA-30B
  • LLaMA-13B
  • Falcon-40B
  • BLOOMZ-7B
  • Baichuan2-13B
  • Baichuan-13B
  • Baichuan-7B
  • InternLM-20B
  • InternLM-7B
  • Xverse-13B
  • ChatGLM2-6B
  • ChatGLM-6B
  • BatGPT-15B
  • SFT
  • Chinese-GLM-10B
  • ZH LLaMA-13B
  • Bactrian-X (BX LLaMA variants)

Metrics

  • Accuracy
  • random baseline (25%)

Datasets

  • CMMLU

Benchmarks

  • MMLU
  • C-Eval
  • M3KE
  • AGIEval
  • CEval