Xiezhi: 249k-question, auto-updating benchmark across 516 disciplines with a 50-option evaluation protocol

Overview

Decision SnapshotNeeds Validation

Xiezhi is ready for benchmarking research and internal model comparisons; teams should validate curated subsets for critical applications because auto-annotated parts can contain label noise.

Citations9

Evidence Strength0.80

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY-SA 4.0

At A Glance

Cost impact: 30%

Production readiness: 70%

Novelty: 60%

Authors

Zhouhong Gu, Xiaoxuan Zhu, Haoning Ye, Lin Zhang, Jianchen Wang, Yixin Zhu, Sihang Jiang, Zhuozhi Xiong, Zihan Li, Weijie Wu, Qianyu He, Rui Xu, Wenhao Huang, Jingping Liu, Zili Wang, Shusen Wang, Weiguo Zheng, Hongwei Feng, Yanghua Xiao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Xiezhi gives a broad, hard-to-game way to measure domain knowledge across many fields; it helps product and engineering teams spot domain blind spots and track small improvements in LLMs over time.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

Xiezhi is a large, auto-updating benchmark for domain knowledge in LLMs. It contains 249,587 multiple-choice questions across 516 disciplines (13 top-level categories). The authors provide two curated subsets (Xiezhi-Specialty: 14,041 Qs; Xiezhi-Interdiscipline: 10,746 Qs) and an annotated meta set (20,124 Qs). Novel evaluation rules: 50 answer options per question and ranking options by generative probability (Mean Reciprocal Rank metric). The repo and code are public. The paper evaluates 47 LLMs and finds GPT-4 tops Xiezhi (MRR≈0.43) while LLMs beat average human performance in some STEM and art domains but lag in economics, law, pedagogy, literature, history, and management. The dataset,

Problem Statement

Existing knowledge benchmarks are too small, too narrow, or quickly become part of LLM training data. They also use short multiple-choice formats (4 options) and answer-extraction prompts that favor models trained on MCQ-style data. We need a broader, fresher, and fairer way to measure domain knowledge and to reveal fine-grained differences among LLMs.

Main Contribution

Xiezhi-All: 249,587 multi-choice questions spanning 516 disciplines (13 categories).

Auto-updating pipeline: human-verified Xiezhi-Meta (20,124 Qs) trains an annotator to label ~170k exam Qs and ~80k generated Qs.

Key Findings

Xiezhi is very large and multi-disciplinary.

Numbers249,587 questions; 516 disciplines; 13 categories

Practical UseUse Xiezhi when you need broad domain coverage and many test items to detect small capability changes.

Evidence RefAbstract; Dataset Construction

Moving from 4 to 50 options lowers random-guess signal and exposes model gaps.

NumbersRandom-guess MRR = 0.089 with 50 options

Practical UseEvaluate models with many distractors to avoid overestimating skill from easy multiple-choice formats.

Evidence RefTable 1 (Random-Guess rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	249,587 questions; 516 disciplines	—	—	Xiezhi-All	Abstract; Dataset Construction	Abstract; Dataset Construction
Random-guess MRR (50 options)	0.089	4-option random guessing (higher effective baseline)	—	Xiezhi evaluation setting	Table 1 (Random-Guess rows)	Table 1

What To Try In 7 Days

Run Xiezhi-Specialty on your model to profile domain strengths and weaknesses.

Switch MCQ evaluation to probability-ranking (MRR) for generative models.

Add more distractors to key MCQ tests to reduce luck-driven gains (try 20–50 options).

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseCC BY-SA 4.0

Code URLs

https://github.com/MikeGu721/XiezhiBenchmark

Data URLs

https://github.com/MikeGu721/XiezhiBenchmark

Risks & Boundaries

Limitations

Strong Chinese-source bias: many questions originate from Chinese exams and surveys.

English translation done by Google Translate plus manual edits; domain terms may be imperfect.

When Not To Use

Do not use Xiezhi as the sole metric for conversational or instruction-following ability.

Avoid using Xiezhi-All (auto-annotated) as definitive ground truth without manual checks.

Failure Modes

Auto-annotation misses secondary labels, producing incomplete discipline tags.

Generative probability ranking can be expensive to compute at scale for large option sets.

Core Entities

Models

GPT-4ChatGPTLLaMABLOOMBLOOMZGPT-NeoXPythiaVicunaBaizeBELLEDoctorGLM

Metrics

MRRHit@1Hit@4AccuracyMean Rank

Datasets

Xiezhi-AllXiezhi-MetaXiezhi-TrainXiezhi-SpecialtyXiezhi-InterdisciplineMMLUC-EvalM3KE

Benchmarks

XiezhiMMLUC-EvalM3KEBIG-bench

Context Entities

Models

GPT-3.5FalconStableLMMOSSH2O-GPT

Datasets

Chinese Graduate Entrance ExamsOpen academic surveys/reviews (question generation source)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Xiezhi is very large and multi-disciplinary.

Moving from 4 to 50 options lowers random-guess signal and exposes model gaps.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding