C-EVAL: 13.9k Chinese multiple-choice exam questions across 52 subjects, plus a HARD subset for advanced reasoning

May 15, 20236 min

Overview

Decision SnapshotReady For Pilot

The benchmark is ready to use for evaluation and debugging; results reliably show relative model strengths, but do not cover safety, bias, or open-ended generation.

Citations90

Evidence Strength0.90

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.

Who Should Care

Summary TLDR

C-EVAL is a large Chinese multiple-choice benchmark (13,948 questions, 52 subjects, four school/professional difficulty levels) designed to test LLM world knowledge and reasoning in Chinese. It includes C-EVAL HARD, an 8-subject subset with difficult math/physics/chemistry problems. The authors evaluate 11 popular LLMs: only GPT-4 surpasses 60% average accuracy (66.4% zero-shot); on HARD GPT-4 scores ~53%. Data are mostly collected from mock exams (PDF/Word) and human-validated; dev exemplars include GPT-4-generated explanations manually revised to enable few-shot chain-of-thought (COT). Test labels are kept private and a leaderboard is available at cevalbenchmark.com.

Problem Statement

Existing LLM benchmarks are mostly English and miss Chinese cultural, legal, and exam-style knowledge. Developers need a comprehensive Chinese test suite that probes advanced knowledge and reasoning and reduces contamination from widely distributed official exam questions.

Main Contribution

C-EVAL: 13,948 Chinese multiple-choice questions across 52 subjects and four difficulty levels (middle, high, college, professional).

C-EVAL HARD: an 8-subject subset (advanced math, physics, chemistry) targeting hard reasoning problems.

Key Findings

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

NumbersGPT-4 average accuracy 66.4% (zero-shot AO, Table 3)

Practical UseIf you need a Chinese-capable LLM that handles broad exam-style knowledge now, GPT-4 is the best off-the-shelf option; other models lag by ~14+ points.

Evidence RefTable 3

C-EVAL HARD remains challenging even for top models.

NumbersGPT-4 on C-EVAL HARD: 53.3% zero-shot AO (Table 6)

Practical UseDon't assume high average scores mean strong advanced reasoning; expect substantial errors on hard STEM problems and focus model improvements there.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 66.4%Random 25%+41.4 ppC-EVAL test (all subjects)Table 3 shows GPT-4 66.4% averageTable 3
AccuracyChatGPT 51.0%Random 25%+26.0 ppC-EVAL test (all subjects)Table 3 shows ChatGPT 51.0% averageTable 3

What To Try In 7 Days

Run your model on the C-EVAL dev/validation splits to get subject-level scores.

Focus improvements on subjects where performance ≈ random, especially advanced STEM.

Try few-shot and chain-of-thought prompts on HARD subjects and log changes per subject.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Multiple-choice format only; not a direct test of open-ended generation or tool use.

Data leakage risk reduced but not eliminated; mock exams may still overlap pretraining corpora.

When Not To Use

To evaluate model safety, bias, or adversarial robustness.

To judge open-ended generation quality or dialogue fluency.

Failure Modes

Models may guess near-random on HARD STEM items even if average accuracy looks reasonable.

Chain-of-thought prompts can reduce accuracy for models not tuned for few-shot COT.

Core Entities

Models

GPT-4ChatGPTClaude-v1.3Claude-instant-v1.0Bloomz-mtGLM-130BChatGLM-6BLLaMA-65BMOSSChinese-Alpaca-13BChinese-LLaMA-13B

Metrics

Accuracy

Datasets

C-EVALC-EVAL HARD

Benchmarks

MMLUBIG-benchHELMAGIEvalMMCU