Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
90
Why It Matters For Business
C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.
Summary TLDR
C-EVAL is a large Chinese multiple-choice benchmark (13,948 questions, 52 subjects, four school/professional difficulty levels) designed to test LLM world knowledge and reasoning in Chinese. It includes C-EVAL HARD, an 8-subject subset with difficult math/physics/chemistry problems. The authors evaluate 11 popular LLMs: only GPT-4 surpasses 60% average accuracy (66.4% zero-shot); on HARD GPT-4 scores ~53%. Data are mostly collected from mock exams (PDF/Word) and human-validated; dev exemplars include GPT-4-generated explanations manually revised to enable few-shot chain-of-thought (COT). Test labels are kept private and a leaderboard is available at cevalbenchmark.com.
Problem Statement
Existing LLM benchmarks are mostly English and miss Chinese cultural, legal, and exam-style knowledge. Developers need a comprehensive Chinese test suite that probes advanced knowledge and reasoning and reduces contamination from widely distributed official exam questions.
Main Contribution
C-EVAL: 13,948 Chinese multiple-choice questions across 52 subjects and four difficulty levels (middle, high, college, professional).
C-EVAL HARD: an 8-subject subset (advanced math, physics, chemistry) targeting hard reasoning problems.
Mitigation steps for data leakage: mostly mock/local exams in PDF/Word, manual parsing and validation.
Dev split includes 5 explanation-annotated exemplars (GPT-4 generated, human-revised) to support few-shot COT.
Comprehensive evaluation of 11 LLMs with public leaderboard and private test split.
Key Findings
Only GPT-4 exceeds 60% average accuracy on C-EVAL.
C-EVAL HARD remains challenging even for top models.
Chinese-oriented models close gaps on culture/politics but lag on STEM reasoning.
Few-shot and chain-of-thought (COT) help some models but can hurt others.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run your model on the C-EVAL dev/validation splits to get subject-level scores.
Focus improvements on subjects where performance ≈ random, especially advanced STEM.
Try few-shot and chain-of-thought prompts on HARD subjects and log changes per subject.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Multiple-choice format only; not a direct test of open-ended generation or tool use.
- Data leakage risk reduced but not eliminated; mock exams may still overlap pretraining corpora.
- HARD subset focuses on STEM; other advanced reasoning types are not emphasized.
- Test split labels are private, requiring submissions to the website for final scores.
When Not To Use
- To evaluate model safety, bias, or adversarial robustness.
- To judge open-ended generation quality or dialogue fluency.
- As the sole metric for production readiness in interactive systems.
Failure Modes
- Models may guess near-random on HARD STEM items even if average accuracy looks reasonable.
- Chain-of-thought prompts can reduce accuracy for models not tuned for few-shot COT.
- Option-order and sampling biases may still affect model accuracy despite permutation tests.
Core Entities
Models
- GPT-4
- ChatGPT
- Claude-v1.3
- Claude-instant-v1.0
- Bloomz-mt
- GLM-130B
- ChatGLM-6B
- LLaMA-65B
- MOSS
- Chinese-Alpaca-13B
- Chinese-LLaMA-13B
Metrics
- Accuracy
Datasets
- C-EVAL
- C-EVAL HARD
Benchmarks
- MMLU
- BIG-bench
- HELM
- AGIEval
- MMCU

