GAOKAO-Bench: using China’s college exam (2010–2022) to test LLMs on real exam questions

May 21, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

18

Authors

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, Xipeng Qiu

Links

Abstract / PDF

Why It Matters For Business

GAOKAO-Bench exposes realistic task gaps: LLMs are good at knowledge and language tasks but weaker at multi-step math and physics. Use this to choose models, design human-in-the-loop checks, and pilot automated grading.

Summary TLDR

The authors build GAOKAO-Bench, a dataset of Chinese GAOKAO exam questions (2010–2022) that mixes objective and subjective items. They evaluate many LLMs (GPT-4, GPT-3.5, ERNIE-Bot, Baichuan, LLaMA, ChatGLM) in zero-shot mode and use human scoring for subjective items. Main findings: GPT-4 scores well (converted totals >400), models do better in humanities than sciences, large subject gaps (poor at math/physics), and GPT-4-turbo can grade subjective answers with high correlation to teachers when given marking criteria.

Problem Statement

Existing LLM benchmarks often use only objective questions or synthetic tasks and miss real-world exam-style subjective items. The field needs a human-aligned, exam-style test suite that measures generative answers and grading ability, and that can expose subject-specific strengths and weaknesses.

Main Contribution

GAOKAO-Bench dataset: national GAOKAO questions (2010–2022), 9 subjects, 2811 questions (1781 objective, 1030 subjective).

Zero-shot evaluation protocol and human scoring for subjective questions; public prompting examples and marking criteria.

LLM-as-a-Judge study: using GPT-4-turbo with teacher marking criteria to grade subjective answers and measuring correlation with human graders.

Released resources and a 2023 supplement (GAOKAO-Bench-2023) to reduce dataset leakage.

Key Findings

GPT-4 attains strong exam performance but below full marks.

NumbersConverted totals: sciences 434, humanities 480 (GPT-4-0613).

Objective and subjective scoring rates differ and vary by subject.

NumbersObjective overall 71.6% vs subjective overall 50.8% (GPT-4-0613).

Large subject gaps: strong in language/biology/geography, weak in math/physics.

NumbersSubjective/objective >70% in English/biology/geography; <40% in math/physics for GPT-4.

LLM-as-a-Judge aligns well with human graders when given marking criteria.

NumbersQuestion-level Spearman ρ ≈ 0.85, Kendall τ ≈ 0.71 (model vs human).

Automated grading is closer to human scores on sciences than humanities.

NumbersJudge deviation <2% (sciences) vs ≈5% (humanities) of total score.

Results

Objective scoring rate (GPT-4-0613)

Value71.6% overall

Subjective scoring rate (GPT-4-0613, human-scored)

Value50.8% overall

Converted total scores (GPT-4-0613, human)

ValueSciences 434, Humanities 480

LLM-as-a-Judge correlation (GPT-4-turbo vs human)

ValueSpearman ρ = 0.854, Kendall τ = 0.71

Stability across years

ValueSmall yearly changes; GPT-4 Δ ≈ -0.6% on 2023 set

Who Should Care

What To Try In 7 Days

Run your model on GAOKAO-Bench zero-shot to see subject gaps.

Test GPT-4-turbo as an automated grader using provided marking criteria and compare to a small set of teacher scores.

Prioritize fine-tuning or tool-use for math/physics before deploying for calculation-heavy tasks.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No deep error analysis of hallucinations or reasoning errors.
  • Human scoring was costly; not all models were evaluated with human grading.
  • Possible dataset leakage into model training is acknowledged but not fully eliminated.

When Not To Use

  • As the only benchmark for math or physics reasoning without additional tool support.
  • To fully replace human graders in humanities tasks without spot checks.
  • For models trained on leaked GAOKAO data unless leakage is checked.

Failure Modes

  • Strong subject bias: good at language/knowledge but weak at multi-step math.
  • Automated judge may over- or under-score humanities without fine-grained rubrics.
  • Performance can be inflated if evaluation samples appear in training data.

Core Entities

Models

  • GPT-4-0613
  • GPT-4-0314
  • GPT-3.5-turbo-0301
  • GPT-4-turbo (judge)
  • ERNIE-Bot-0615
  • ERNIE-Bot-turbo-0725
  • LLaMA-7b
  • Vicuna-7b
  • Baichuan2-7b-Base
  • Baichuan2-7b-Chat
  • Baichuan2-13b-Chat
  • ChatGLM-6b
  • ChatGLM2-6b

Metrics

  • scoring rate
  • converted total score
  • Spearman correlation
  • Kendall-Tau correlation

Datasets

  • GAOKAO-Bench (2010-2022)
  • GAOKAO-Bench-2023

Benchmarks

  • GAOKAO-Bench