Overview
The benchmark is ready to use for evaluation and debugging; results reliably show relative model strengths, but do not cover safety, bias, or open-ended generation.
Citations90
Evidence Strength0.90
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
C-EVAL exposes where Chinese users' LLMs fail on domain knowledge and hard reasoning; testing with it reveals gaps you need to fix before product launch.
Who Should Care
Summary TLDR
C-EVAL is a large Chinese multiple-choice benchmark (13,948 questions, 52 subjects, four school/professional difficulty levels) designed to test LLM world knowledge and reasoning in Chinese. It includes C-EVAL HARD, an 8-subject subset with difficult math/physics/chemistry problems. The authors evaluate 11 popular LLMs: only GPT-4 surpasses 60% average accuracy (66.4% zero-shot); on HARD GPT-4 scores ~53%. Data are mostly collected from mock exams (PDF/Word) and human-validated; dev exemplars include GPT-4-generated explanations manually revised to enable few-shot chain-of-thought (COT). Test labels are kept private and a leaderboard is available at cevalbenchmark.com.
Problem Statement
Existing LLM benchmarks are mostly English and miss Chinese cultural, legal, and exam-style knowledge. Developers need a comprehensive Chinese test suite that probes advanced knowledge and reasoning and reduces contamination from widely distributed official exam questions.
Main Contribution
C-EVAL: 13,948 Chinese multiple-choice questions across 52 subjects and four difficulty levels (middle, high, college, professional).
C-EVAL HARD: an 8-subject subset (advanced math, physics, chemistry) targeting hard reasoning problems.
Key Findings
Only GPT-4 exceeds 60% average accuracy on C-EVAL.
C-EVAL HARD remains challenging even for top models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 66.4% | Random 25% | +41.4 pp | C-EVAL test (all subjects) | Table 3 shows GPT-4 66.4% average | Table 3 |
| Accuracy | ChatGPT 51.0% | Random 25% | +26.0 pp | C-EVAL test (all subjects) | Table 3 shows ChatGPT 51.0% average | Table 3 |
What To Try In 7 Days
Run your model on the C-EVAL dev/validation splits to get subject-level scores.
Focus improvements on subjects where performance ≈ random, especially advanced STEM.
Try few-shot and chain-of-thought prompts on HARD subjects and log changes per subject.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Multiple-choice format only; not a direct test of open-ended generation or tool use.
Data leakage risk reduced but not eliminated; mock exams may still overlap pretraining corpora.
When Not To Use
To evaluate model safety, bias, or adversarial robustness.
To judge open-ended generation quality or dialogue fluency.
Failure Modes
Models may guess near-random on HARD STEM items even if average accuracy looks reasonable.
Chain-of-thought prompts can reduce accuracy for models not tuned for few-shot COT.

