Xiezhi: 249k-question, auto-updating benchmark across 516 disciplines with a 50-option evaluation protocol

June 9, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

9

Authors

Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, Yanghua Xiao

Links

Abstract / PDF

Why It Matters For Business

Xiezhi gives a broad, hard-to-game way to measure domain knowledge across many fields; it helps product and engineering teams spot domain blind spots and track small improvements in LLMs over time.

Summary TLDR

Xiezhi is a large, auto-updating benchmark for domain knowledge in LLMs. It contains 249,587 multiple-choice questions across 516 disciplines (13 top-level categories). The authors provide two curated subsets (Xiezhi-Specialty: 14,041 Qs; Xiezhi-Interdiscipline: 10,746 Qs) and an annotated meta set (20,124 Qs). Novel evaluation rules: 50 answer options per question and ranking options by generative probability (Mean Reciprocal Rank metric). The repo and code are public. The paper evaluates 47 LLMs and finds GPT-4 tops Xiezhi (MRR≈0.43) while LLMs beat average human performance in some STEM and art domains but lag in economics, law, pedagogy, literature, history, and management. The dataset,

Problem Statement

Existing knowledge benchmarks are too small, too narrow, or quickly become part of LLM training data. They also use short multiple-choice formats (4 options) and answer-extraction prompts that favor models trained on MCQ-style data. We need a broader, fresher, and fairer way to measure domain knowledge and to reveal fine-grained differences among LLMs.

Main Contribution

Xiezhi-All: 249,587 multi-choice questions spanning 516 disciplines (13 categories).

Auto-updating pipeline: human-verified Xiezhi-Meta (20,124 Qs) trains an annotator to label ~170k exam Qs and ~80k generated Qs.

New evaluation protocol: 50 options per question and rank options by generative probability (use MRR as primary metric).

Two curated English/Chinese subsets: Xiezhi-Specialty (14,041 Qs) and Xiezhi-Interdiscipline (10,746 Qs).

Benchmarking study: evaluation of 47 public and API LLMs; code and data released on GitHub.

Key Findings

Xiezhi is very large and multi-disciplinary.

Numbers249,587 questions; 516 disciplines; 13 categories

Moving from 4 to 50 options lowers random-guess signal and exposes model gaps.

NumbersRandom-guess MRR = 0.089 with 50 options

Ranking options by generative probability gives a more reliable score for generative models.

NumbersPrimary metric used: MRR; GPT-4 overall MRR ≈ 0.431 on Xiezhi

LLMs outperform average human practitioners in some domains but underperform in others.

NumbersAuthors report LLMs exceed average humans in science, engineering, agronomy, medicine, art; lag in economics, law, pedag

Auto-annotation reduces human effort but misses fine-grained labels.

NumbersAnnotation model shows low Wrong Rate but higher Missing Rate for fine-grained labels (see Auto Annotator results)

Results

Dataset size

Value249,587 questions; 516 disciplines

Random-guess MRR (50 options)

Value0.089

Baseline4-option random guessing (higher effective baseline)

Top model overall MRR

ValueGPT-4 MRR ≈ 0.431

BaselineRandom-guess MRR 0.089

Human benchmarks (example: Science)

ValueHuman top = 0.926; Human average = 0.394

Who Should Care

What To Try In 7 Days

Run Xiezhi-Specialty on your model to profile domain strengths and weaknesses.

Switch MCQ evaluation to probability-ranking (MRR) for generative models.

Add more distractors to key MCQ tests to reduce luck-driven gains (try 20–50 options).

Reproducibility

License

  • CC BY-SA 4.0

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Strong Chinese-source bias: many questions originate from Chinese exams and surveys.
  • English translation done by Google Translate plus manual edits; domain terms may be imperfect.
  • Auto-annotated portions (Xiezhi-All) contain label noise and missing fine-grained labels.
  • Human baselines are derived from public exam averages and are noisy proxies for real practitioner ability.
  • Sensitive topics and deeply Chinese cultural items were removed, reducing some domain coverage.

When Not To Use

  • Do not use Xiezhi as the sole metric for conversational or instruction-following ability.
  • Avoid using Xiezhi-All (auto-annotated) as definitive ground truth without manual checks.
  • Not ideal for culture-specific knowledge evaluation outside Chinese/English without further localization.

Failure Modes

  • Auto-annotation misses secondary labels, producing incomplete discipline tags.
  • Generative probability ranking can be expensive to compute at scale for large option sets.
  • High-option MCQs may over-penalize models that are good at open-ended reasoning but poor at ranking many distractors.
  • Translation artifacts can cause English-evaluated models to misinterpret questions.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • LLaMA
  • BLOOM
  • BLOOMZ
  • GPT-NeoX
  • Pythia
  • Vicuna
  • Baize
  • BELLE
  • DoctorGLM

Metrics

  • MRR
  • Hit@1
  • Hit@4
  • Accuracy
  • Mean Rank

Datasets

  • Xiezhi-All
  • Xiezhi-Meta
  • Xiezhi-Train
  • Xiezhi-Specialty
  • Xiezhi-Interdiscipline
  • MMLU
  • C-Eval
  • M3KE

Benchmarks

  • Xiezhi
  • MMLU
  • C-Eval
  • M3KE
  • BIG-bench

Context Entities

Models

  • GPT-3.5
  • Falcon
  • StableLM
  • MOSS
  • H2O-GPT

Datasets

  • Chinese Graduate Entrance Exams
  • Open academic surveys/reviews (question generation source)