GAOKAO-Bench: using China’s college exam (2010–2022) to test LLMs on real exam questions

May 21, 20236 min

Overview

Decision SnapshotReady For Pilot

The benchmark and experiments are practical and reproducible; human scoring gives solid evidence, but broad adoption needs more error analysis and more models graded by humans.

Citations18

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, Xipeng Qiu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.

Who Should Care

Summary TLDR

The authors build GAOKAO-Bench, a dataset of Chinese GAOKAO exam questions (2010–2022) that mixes objective and subjective items. They evaluate many LLMs (GPT-4, GPT-3.5, ERNIE-Bot, Baichuan, LLaMA, ChatGLM) in zero-shot mode and use human scoring for subjective items. Main findings: GPT-4 scores well (converted totals >400), models do better in humanities than sciences, large subject gaps (poor at math/physics), and GPT-4-turbo can grade subjective answers with high correlation to teachers when given marking criteria.

Problem Statement

Existing LLM benchmarks often use only objective questions or synthetic tasks and miss real-world exam-style subjective items. The field needs a human-aligned, exam-style test suite that measures generative answers and grading ability, and that can expose subject-specific strengths and weaknesses.

Main Contribution

GAOKAO-Bench dataset: national GAOKAO questions (2010–2022), 9 subjects, 2811 questions (1781 objective, 1030 subjective).

Zero-shot evaluation protocol and human scoring for subjective questions; public prompting examples and marking criteria.

Key Findings

GPT-4 attains strong exam performance but below full marks.

NumbersConverted totals: sciences 434, humanities 480 (GPT-4-0613).

Practical UseGPT-4 can pass many exam-style tasks but still leaves gaps; use it for knowledge tasks but validate numerical/reasoning outputs.

Evidence RefTable 4

Objective and subjective scoring rates differ and vary by subject.

NumbersObjective overall 71.6% vs subjective overall 50.8% (GPT-4-0613).

Practical UseReport both objective and subjective metrics; expect lower raw scores on free-form questions and plan human review for high-stakes cases.

Evidence RefTable 1 & Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Objective scoring rate (GPT-4-0613)71.6% overallGAOKAO-Bench objective (Table 1)Table 1: GPT-4-0613 objective overall 71.6%Table 1
Subjective scoring rate (GPT-4-0613, human-scored)50.8% overallGAOKAO-Bench subjective (Table 2)Table 2: GPT-4-0613 subjective overall 50.8%Table 2

What To Try In 7 Days

Run your model on GAOKAO-Bench zero-shot to see subject gaps.

Test GPT-4-turbo as an automated grader using provided marking criteria and compare to a small set of teacher scores.

Prioritize fine-tuning or tool-use for math/physics before deploying for calculation-heavy tasks.

Reproducibility

Risks & Boundaries

Limitations

No deep error analysis of hallucinations or reasoning errors.

Human scoring was costly; not all models were evaluated with human grading.

When Not To Use

As the only benchmark for math or physics reasoning without additional tool support.

To fully replace human graders in humanities tasks without spot checks.

Failure Modes

Strong subject bias: good at language/knowledge but weak at multi-step math.

Automated judge may over- or under-score humanities without fine-grained rubrics.

Core Entities

Models

GPT-4-0613GPT-4-0314GPT-3.5-turbo-0301GPT-4-turbo (judge)ERNIE-Bot-0615ERNIE-Bot-turbo-0725LLaMA-7bVicuna-7bBaichuan2-7b-BaseBaichuan2-7b-ChatBaichuan2-13b-ChatChatGLM-6bChatGLM2-6b

Metrics

scoring rateconverted total scoreSpearman correlationKendall-Tau correlation

Datasets

GAOKAO-Bench (2010-2022)GAOKAO-Bench-2023

Benchmarks

GAOKAO-Bench