GAOKAO-Bench: using China’s college exam (2010–2022) to test LLMs on real exam questions

Overview

Decision SnapshotReady For Pilot

The benchmark and experiments are practical and reproducible; human scoring gives solid evidence, but broad adoption needs more error analysis and more models graded by humans.

Citations18

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, Xipeng Qiu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The authors build GAOKAO-Bench, a dataset of Chinese GAOKAO exam questions (2010–2022) that mixes objective and subjective items. They evaluate many LLMs (GPT-4, GPT-3.5, ERNIE-Bot, Baichuan, LLaMA, ChatGLM) in zero-shot mode and use human scoring for subjective items. Main findings: GPT-4 scores well (converted totals >400), models do better in humanities than sciences, large subject gaps (poor at math/physics), and GPT-4-turbo can grade subjective answers with high correlation to teachers when given marking criteria.

Problem Statement

Existing LLM benchmarks often use only objective questions or synthetic tasks and miss real-world exam-style subjective items. The field needs a human-aligned, exam-style test suite that measures generative answers and grading ability, and that can expose subject-specific strengths and weaknesses.

Main Contribution

GAOKAO-Bench dataset: national GAOKAO questions (2010–2022), 9 subjects, 2811 questions (1781 objective, 1030 subjective).

Zero-shot evaluation protocol and human scoring for subjective questions; public prompting examples and marking criteria.

Key Findings

GPT-4 attains strong exam performance but below full marks.

NumbersConverted totals: sciences 434, humanities 480 (GPT-4-0613).

Practical UseGPT-4 can pass many exam-style tasks but still leaves gaps; use it for knowledge tasks but validate numerical/reasoning outputs.

Evidence RefTable 4

Objective and subjective scoring rates differ and vary by subject.

NumbersObjective overall 71.6% vs subjective overall 50.8% (GPT-4-0613).

Practical UseReport both objective and subjective metrics; expect lower raw scores on free-form questions and plan human review for high-stakes cases.

Evidence RefTable 1 & Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Objective scoring rate (GPT-4-0613)	71.6% overall	—	—	GAOKAO-Bench objective (Table 1)	Table 1: GPT-4-0613 objective overall 71.6%	Table 1
Subjective scoring rate (GPT-4-0613, human-scored)	50.8% overall	—	—	GAOKAO-Bench subjective (Table 2)	Table 2: GPT-4-0613 subjective overall 50.8%	Table 2

What To Try In 7 Days

Run your model on GAOKAO-Bench zero-shot to see subject gaps.

Test GPT-4-turbo as an automated grader using provided marking criteria and compare to a small set of teacher scores.

Prioritize fine-tuning or tool-use for math/physics before deploying for calculation-heavy tasks.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/OpenLMLab/GAOKAO-Bench https://github.com/OpenLMLab/GAOKAO-Bench-2023

Data URLs

https://github.com/OpenLMLab/GAOKAO-Bench https://github.com/OpenLMLab/GAOKAO-Bench-2023

Risks & Boundaries

Limitations

No deep error analysis of hallucinations or reasoning errors.

Human scoring was costly; not all models were evaluated with human grading.

When Not To Use

As the only benchmark for math or physics reasoning without additional tool support.

To fully replace human graders in humanities tasks without spot checks.

Failure Modes

Strong subject bias: good at language/knowledge but weak at multi-step math.

Automated judge may over- or under-score humanities without fine-grained rubrics.

Core Entities

Models

GPT-4-0613GPT-4-0314GPT-3.5-turbo-0301GPT-4-turbo (judge)ERNIE-Bot-0615ERNIE-Bot-turbo-0725LLaMA-7bVicuna-7bBaichuan2-7b-BaseBaichuan2-7b-ChatBaichuan2-13b-ChatChatGLM-6bChatGLM2-6b

Metrics

scoring rateconverted total scoreSpearman correlationKendall-Tau correlation

Datasets

GAOKAO-Bench (2010-2022)GAOKAO-Bench-2023

Benchmarks

GAOKAO-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 attains strong exam performance but below full marks.

Objective and subjective scoring rates differ and vary by subject.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding