Overview
The benchmark and experiments are practical and reproducible; human scoring gives solid evidence, but broad adoption needs more error analysis and more models graded by humans.
Citations18
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.
Who Should Care
Summary TLDR
The authors build GAOKAO-Bench, a dataset of Chinese GAOKAO exam questions (2010–2022) that mixes objective and subjective items. They evaluate many LLMs (GPT-4, GPT-3.5, ERNIE-Bot, Baichuan, LLaMA, ChatGLM) in zero-shot mode and use human scoring for subjective items. Main findings: GPT-4 scores well (converted totals >400), models do better in humanities than sciences, large subject gaps (poor at math/physics), and GPT-4-turbo can grade subjective answers with high correlation to teachers when given marking criteria.
Problem Statement
Existing LLM benchmarks often use only objective questions or synthetic tasks and miss real-world exam-style subjective items. The field needs a human-aligned, exam-style test suite that measures generative answers and grading ability, and that can expose subject-specific strengths and weaknesses.
Main Contribution
GAOKAO-Bench dataset: national GAOKAO questions (2010–2022), 9 subjects, 2811 questions (1781 objective, 1030 subjective).
Zero-shot evaluation protocol and human scoring for subjective questions; public prompting examples and marking criteria.
Key Findings
GPT-4 attains strong exam performance but below full marks.
Objective and subjective scoring rates differ and vary by subject.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Objective scoring rate (GPT-4-0613) | 71.6% overall | — | — | GAOKAO-Bench objective (Table 1) | Table 1: GPT-4-0613 objective overall 71.6% | Table 1 |
| Subjective scoring rate (GPT-4-0613, human-scored) | 50.8% overall | — | — | GAOKAO-Bench subjective (Table 2) | Table 2: GPT-4-0613 subjective overall 50.8% | Table 2 |
What To Try In 7 Days
Run your model on GAOKAO-Bench zero-shot to see subject gaps.
Test GPT-4-turbo as an automated grader using provided marking criteria and compare to a small set of teacher scores.
Prioritize fine-tuning or tool-use for math/physics before deploying for calculation-heavy tasks.
Reproducibility
Risks & Boundaries
Limitations
No deep error analysis of hallucinations or reasoning errors.
Human scoring was costly; not all models were evaluated with human grading.
When Not To Use
As the only benchmark for math or physics reasoning without additional tool support.
To fully replace human graders in humanities tasks without spot checks.
Failure Modes
Strong subject bias: good at language/knowledge but weak at multi-step math.
Automated judge may over- or under-score humanities without fine-grained rubrics.

