Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
18
Why It Matters For Business
GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.
Summary TLDR
The authors build GAOKAO-Bench, a dataset of Chinese GAOKAO exam questions (2010–2022) that mixes objective and subjective items. They evaluate many LLMs (GPT-4, GPT-3.5, ERNIE-Bot, Baichuan, LLaMA, ChatGLM) in zero-shot mode and use human scoring for subjective items. Main findings: GPT-4 scores well (converted totals >400), models do better in humanities than sciences, large subject gaps (poor at math/physics), and GPT-4-turbo can grade subjective answers with high correlation to teachers when given marking criteria.
Problem Statement
Existing LLM benchmarks often use only objective questions or synthetic tasks and miss real-world exam-style subjective items. The field needs a human-aligned, exam-style test suite that measures generative answers and grading ability, and that can expose subject-specific strengths and weaknesses.
Main Contribution
GAOKAO-Bench dataset: national GAOKAO questions (2010–2022), 9 subjects, 2811 questions (1781 objective, 1030 subjective).
Zero-shot evaluation protocol and human scoring for subjective questions; public prompting examples and marking criteria.
LLM-as-a-Judge study: using GPT-4-turbo with teacher marking criteria to grade subjective answers and measuring correlation with human graders.
Released resources and a 2023 supplement (GAOKAO-Bench-2023) to reduce dataset leakage.
Key Findings
GPT-4 attains strong exam performance but below full marks.
Objective and subjective scoring rates differ and vary by subject.
Large subject gaps: strong in language/biology/geography, weak in math/physics.
LLM-as-a-Judge aligns well with human graders when given marking criteria.
Automated grading is closer to human scores on sciences than humanities.
Results
Objective scoring rate (GPT-4-0613)
Subjective scoring rate (GPT-4-0613, human-scored)
Converted total scores (GPT-4-0613, human)
LLM-as-a-Judge correlation (GPT-4-turbo vs human)
Stability across years
Who Should Care
What To Try In 7 Days
Run your model on GAOKAO-Bench zero-shot to see subject gaps.
Test GPT-4-turbo as an automated grader using provided marking criteria and compare to a small set of teacher scores.
Prioritize fine-tuning or tool-use for math/physics before deploying for calculation-heavy tasks.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No deep error analysis of hallucinations or reasoning errors.
- Human scoring was costly; not all models were evaluated with human grading.
- Possible dataset leakage into model training is acknowledged but not fully eliminated.
When Not To Use
- As the only benchmark for math or physics reasoning without additional tool support.
- To fully replace human graders in humanities tasks without spot checks.
- For models trained on leaked GAOKAO data unless leakage is checked.
Failure Modes
- Strong subject bias: good at language/knowledge but weak at multi-step math.
- Automated judge may over- or under-score humanities without fine-grained rubrics.
- Performance can be inflated if evaluation samples appear in training data.
Core Entities
Models
- GPT-4-0613
- GPT-4-0314
- GPT-3.5-turbo-0301
- GPT-4-turbo (judge)
- ERNIE-Bot-0615
- ERNIE-Bot-turbo-0725
- LLaMA-7b
- Vicuna-7b
- Baichuan2-7b-Base
- Baichuan2-7b-Chat
- Baichuan2-13b-Chat
- ChatGLM-6b
- ChatGLM2-6b
Metrics
- scoring rate
- converted total score
- Spearman correlation
- Kendall-Tau correlation
Datasets
- GAOKAO-Bench (2010-2022)
- GAOKAO-Bench-2023
Benchmarks
- GAOKAO-Bench

