Xiezhi: 249k-question, auto-updating benchmark across 516 disciplines with a 50-option evaluation protocol

June 9, 20237 min

Overview

Decision SnapshotNeeds Validation

Xiezhi is ready for benchmarking research and internal model comparisons; teams should validate curated subsets for critical applications because auto-annotated parts can contain label noise.

Citations9

Evidence Strength0.80

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY-SA 4.0

At A Glance

Cost impact: 30%

Production readiness: 70%

Novelty: 60%

Authors

Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, Yanghua Xiao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Xiezhi gives a broad, hard-to-game way to measure domain knowledge across many fields; it helps product and engineering teams spot domain blind spots and track small improvements in LLMs over time.

Who Should Care

Summary TLDR

Xiezhi is a large, auto-updating benchmark for domain knowledge in LLMs. It contains 249,587 multiple-choice questions across 516 disciplines (13 top-level categories). The authors provide two curated subsets (Xiezhi-Specialty: 14,041 Qs; Xiezhi-Interdiscipline: 10,746 Qs) and an annotated meta set (20,124 Qs). Novel evaluation rules: 50 answer options per question and ranking options by generative probability (Mean Reciprocal Rank metric). The repo and code are public. The paper evaluates 47 LLMs and finds GPT-4 tops Xiezhi (MRR≈0.43) while LLMs beat average human performance in some STEM and art domains but lag in economics, law, pedagogy, literature, history, and management. The dataset,

Problem Statement

Existing knowledge benchmarks are too small, too narrow, or quickly become part of LLM training data. They also use short multiple-choice formats (4 options) and answer-extraction prompts that favor models trained on MCQ-style data. We need a broader, fresher, and fairer way to measure domain knowledge and to reveal fine-grained differences among LLMs.

Main Contribution

Xiezhi-All: 249,587 multi-choice questions spanning 516 disciplines (13 categories).

Auto-updating pipeline: human-verified Xiezhi-Meta (20,124 Qs) trains an annotator to label ~170k exam Qs and ~80k generated Qs.

Key Findings

Xiezhi is very large and multi-disciplinary.

Numbers249,587 questions; 516 disciplines; 13 categories

Practical UseUse Xiezhi when you need broad domain coverage and many test items to detect small capability changes.

Evidence RefAbstract; Dataset Construction

Moving from 4 to 50 options lowers random-guess signal and exposes model gaps.

NumbersRandom-guess MRR = 0.089 with 50 options

Practical UseEvaluate models with many distractors to avoid overestimating skill from easy multiple-choice formats.

Evidence RefTable 1 (Random-Guess rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size249,587 questions; 516 disciplinesXiezhi-AllAbstract; Dataset ConstructionAbstract; Dataset Construction
Random-guess MRR (50 options)0.0894-option random guessing (higher effective baseline)Xiezhi evaluation settingTable 1 (Random-Guess rows)Table 1

What To Try In 7 Days

Run Xiezhi-Specialty on your model to profile domain strengths and weaknesses.

Switch MCQ evaluation to probability-ranking (MRR) for generative models.

Add more distractors to key MCQ tests to reduce luck-driven gains (try 20–50 options).

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseCC BY-SA 4.0

Risks & Boundaries

Limitations

Strong Chinese-source bias: many questions originate from Chinese exams and surveys.

English translation done by Google Translate plus manual edits; domain terms may be imperfect.

When Not To Use

Do not use Xiezhi as the sole metric for conversational or instruction-following ability.

Avoid using Xiezhi-All (auto-annotated) as definitive ground truth without manual checks.

Failure Modes

Auto-annotation misses secondary labels, producing incomplete discipline tags.

Generative probability ranking can be expensive to compute at scale for large option sets.

Core Entities

Models

GPT-4ChatGPTLLaMABLOOMBLOOMZGPT-NeoXPythiaVicunaBaizeBELLEDoctorGLM

Metrics

MRRHit@1Hit@4AccuracyMean Rank

Datasets

Xiezhi-AllXiezhi-MetaXiezhi-TrainXiezhi-SpecialtyXiezhi-InterdisciplineMMLUC-EvalM3KE

Benchmarks

XiezhiMMLUC-EvalM3KEBIG-bench

Context Entities

Models

GPT-3.5FalconStableLMMOSSH2O-GPT

Datasets

Chinese Graduate Entrance ExamsOpen academic surveys/reviews (question generation source)