C-EVAL: 13.9k Chinese multiple-choice exam questions across 52 subjects, plus a HARD subset for advanced reasoning

Overview

Decision SnapshotReady For Pilot

The benchmark is ready to use for evaluation and debugging; results reliably show relative model strengths, but do not cover safety, bias, or open-ended generation.

Citations90

Evidence Strength0.90

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

C-EVAL is a large Chinese multiple-choice benchmark (13,948 questions, 52 subjects, four school/professional difficulty levels) designed to test LLM world knowledge and reasoning in Chinese. It includes C-EVAL HARD, an 8-subject subset with difficult math/physics/chemistry problems. The authors evaluate 11 popular LLMs: only GPT-4 surpasses 60% average accuracy (66.4% zero-shot); on HARD GPT-4 scores ~53%. Data are mostly collected from mock exams (PDF/Word) and human-validated; dev exemplars include GPT-4-generated explanations manually revised to enable few-shot chain-of-thought (COT). Test labels are kept private and a leaderboard is available at cevalbenchmark.com.

Problem Statement

Existing LLM benchmarks are mostly English and miss Chinese cultural, legal, and exam-style knowledge. Developers need a comprehensive Chinese test suite that probes advanced knowledge and reasoning and reduces contamination from widely distributed official exam questions.

Main Contribution

C-EVAL: 13,948 Chinese multiple-choice questions across 52 subjects and four difficulty levels (middle, high, college, professional).

C-EVAL HARD: an 8-subject subset (advanced math, physics, chemistry) targeting hard reasoning problems.

Key Findings

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

NumbersGPT-4 average accuracy 66.4% (zero-shot AO, Table 3)

Practical UseIf you need a Chinese-capable LLM that handles broad exam-style knowledge now, GPT-4 is the best off-the-shelf option; other models lag by ~14+ points.

Evidence RefTable 3

C-EVAL HARD remains challenging even for top models.

NumbersGPT-4 on C-EVAL HARD: 53.3% zero-shot AO (Table 6)

Practical UseDon't assume high average scores mean strong advanced reasoning; expect substantial errors on hard STEM problems and focus model improvements there.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 66.4%	Random 25%	+41.4 pp	C-EVAL test (all subjects)	Table 3 shows GPT-4 66.4% average	Table 3
Accuracy	ChatGPT 51.0%	Random 25%	+26.0 pp	C-EVAL test (all subjects)	Table 3 shows ChatGPT 51.0% average	Table 3

What To Try In 7 Days

Run your model on the C-EVAL dev/validation splits to get subject-level scores.

Focus improvements on subjects where performance ≈ random, especially advanced STEM.

Try few-shot and chain-of-thought prompts on HARD subjects and log changes per subject.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hkust-nlp/ceval

Data URLs

https://github.com/hkust-nlp/ceval https://cevalbenchmark.com

Risks & Boundaries

Limitations

Multiple-choice format only; not a direct test of open-ended generation or tool use.

Data leakage risk reduced but not eliminated; mock exams may still overlap pretraining corpora.

When Not To Use

To evaluate model safety, bias, or adversarial robustness.

To judge open-ended generation quality or dialogue fluency.

Failure Modes

Models may guess near-random on HARD STEM items even if average accuracy looks reasonable.

Chain-of-thought prompts can reduce accuracy for models not tuned for few-shot COT.

Core Entities

Models

GPT-4ChatGPTClaude-v1.3Claude-instant-v1.0Bloomz-mtGLM-130BChatGLM-6BLLaMA-65BMOSSChinese-Alpaca-13BChinese-LLaMA-13B

Metrics

Accuracy

Datasets

C-EVALC-EVAL HARD

Benchmarks

MMLUBIG-benchHELMAGIEvalMMCU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Only GPT-4 exceeds 60% average accuracy on C-EVAL.

C-EVAL HARD remains challenging even for top models.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding