C-EVAL: 13.9k Chinese multiple-choice exam questions across 52 subjects, plus a HARD subset for advanced reasoning

May 15, 20236 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

90

Authors

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He

Links

Abstract / PDF

Why It Matters For Business

C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.

Summary TLDR

C-EVAL is a large Chinese multiple-choice benchmark (13,948 questions, 52 subjects, four school/professional difficulty levels) designed to test LLM world knowledge and reasoning in Chinese. It includes C-EVAL HARD, an 8-subject subset with difficult math/physics/chemistry problems. The authors evaluate 11 popular LLMs: only GPT-4 surpasses 60% average accuracy (66.4% zero-shot); on HARD GPT-4 scores ~53%. Data are mostly collected from mock exams (PDF/Word) and human-validated; dev exemplars include GPT-4-generated explanations manually revised to enable few-shot chain-of-thought (COT). Test labels are kept private and a leaderboard is available at cevalbenchmark.com.

Problem Statement

Existing LLM benchmarks are mostly English and miss Chinese cultural, legal, and exam-style knowledge. Developers need a comprehensive Chinese test suite that probes advanced knowledge and reasoning and reduces contamination from widely distributed official exam questions.

Main Contribution

C-EVAL: 13,948 Chinese multiple-choice questions across 52 subjects and four difficulty levels (middle, high, college, professional).

C-EVAL HARD: an 8-subject subset (advanced math, physics, chemistry) targeting hard reasoning problems.

Mitigation steps for data leakage: mostly mock/local exams in PDF/Word, manual parsing and validation.

Dev split includes 5 explanation-annotated exemplars (GPT-4 generated, human-revised) to support few-shot COT.

Comprehensive evaluation of 11 LLMs with public leaderboard and private test split.

Key Findings

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

NumbersGPT-4 average accuracy 66.4% (zero-shot AO, Table 3)

C-EVAL HARD remains challenging even for top models.

NumbersGPT-4 on C-EVAL HARD: 53.3% zero-shot AO (Table 6)

Chinese-oriented models close gaps on culture/politics but lag on STEM reasoning.

NumbersGLM-130B overall 44.0% vs ChatGPT 51.0% (zero-shot AO, Table 3); gap narrows in humanities/social science

Few-shot and chain-of-thought (COT) help some models but can hurt others.

NumbersGPT-4 improves from 66.4% to 68.7% with five-shot AO and to 68.3% with COT (Tables 4–5); some instruction-tuned models'

Results

Accuracy

ValueGPT-4 66.4%

BaselineRandom 25%

Accuracy

ValueChatGPT 51.0%

BaselineRandom 25%

Accuracy

ValueGPT-4 53.3%

BaselineRandom 25%

Accuracy

ValueGPT-4 68.7%

BaselineGPT-4 zero-shot 66.4%

Accuracy

ValueGLM-130B 44.0%

BaselineChatGPT 51.0%

Who Should Care

What To Try In 7 Days

Run your model on the C-EVAL dev/validation splits to get subject-level scores.

Focus improvements on subjects where performance ≈ random, especially advanced STEM.

Try few-shot and chain-of-thought prompts on HARD subjects and log changes per subject.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Multiple-choice format only; not a direct test of open-ended generation or tool use.
  • Data leakage risk reduced but not eliminated; mock exams may still overlap pretraining corpora.
  • HARD subset focuses on STEM; other advanced reasoning types are not emphasized.
  • Test split labels are private, requiring submissions to the website for final scores.

When Not To Use

  • To evaluate model safety, bias, or adversarial robustness.
  • To judge open-ended generation quality or dialogue fluency.
  • As the sole metric for production readiness in interactive systems.

Failure Modes

  • Models may guess near-random on HARD STEM items even if average accuracy looks reasonable.
  • Chain-of-thought prompts can reduce accuracy for models not tuned for few-shot COT.
  • Option-order and sampling biases may still affect model accuracy despite permutation tests.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • Claude-v1.3
  • Claude-instant-v1.0
  • Bloomz-mt
  • GLM-130B
  • ChatGLM-6B
  • LLaMA-65B
  • MOSS
  • Chinese-Alpaca-13B
  • Chinese-LLaMA-13B

Metrics

  • Accuracy

Datasets

  • C-EVAL
  • C-EVAL HARD

Benchmarks

  • MMLU
  • BIG-bench
  • HELM
  • AGIEval
  • MMCU