Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
9
Why It Matters For Business
Xiezhi gives a broad, hard-to-game way to measure domain knowledge across many fields; it helps product and engineering teams spot domain blind spots and track small improvements in LLMs over time.
Summary TLDR
Xiezhi is a large, auto-updating benchmark for domain knowledge in LLMs. It contains 249,587 multiple-choice questions across 516 disciplines (13 top-level categories). The authors provide two curated subsets (Xiezhi-Specialty: 14,041 Qs; Xiezhi-Interdiscipline: 10,746 Qs) and an annotated meta set (20,124 Qs). Novel evaluation rules: 50 answer options per question and ranking options by generative probability (Mean Reciprocal Rank metric). The repo and code are public. The paper evaluates 47 LLMs and finds GPT-4 tops Xiezhi (MRR≈0.43) while LLMs beat average human performance in some STEM and art domains but lag in economics, law, pedagogy, literature, history, and management. The dataset,
Problem Statement
Existing knowledge benchmarks are too small, too narrow, or quickly become part of LLM training data. They also use short multiple-choice formats (4 options) and answer-extraction prompts that favor models trained on MCQ-style data. We need a broader, fresher, and fairer way to measure domain knowledge and to reveal fine-grained differences among LLMs.
Main Contribution
Xiezhi-All: 249,587 multi-choice questions spanning 516 disciplines (13 categories).
Auto-updating pipeline: human-verified Xiezhi-Meta (20,124 Qs) trains an annotator to label ~170k exam Qs and ~80k generated Qs.
New evaluation protocol: 50 options per question and rank options by generative probability (use MRR as primary metric).
Two curated English/Chinese subsets: Xiezhi-Specialty (14,041 Qs) and Xiezhi-Interdiscipline (10,746 Qs).
Benchmarking study: evaluation of 47 public and API LLMs; code and data released on GitHub.
Key Findings
Xiezhi is very large and multi-disciplinary.
Moving from 4 to 50 options lowers random-guess signal and exposes model gaps.
Ranking options by generative probability gives a more reliable score for generative models.
LLMs outperform average human practitioners in some domains but underperform in others.
Auto-annotation reduces human effort but misses fine-grained labels.
Results
Dataset size
Random-guess MRR (50 options)
Top model overall MRR
Human benchmarks (example: Science)
Who Should Care
What To Try In 7 Days
Run Xiezhi-Specialty on your model to profile domain strengths and weaknesses.
Switch MCQ evaluation to probability-ranking (MRR) for generative models.
Add more distractors to key MCQ tests to reduce luck-driven gains (try 20–50 options).
Reproducibility
License
- CC BY-SA 4.0
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Strong Chinese-source bias: many questions originate from Chinese exams and surveys.
- English translation done by Google Translate plus manual edits; domain terms may be imperfect.
- Auto-annotated portions (Xiezhi-All) contain label noise and missing fine-grained labels.
- Human baselines are derived from public exam averages and are noisy proxies for real practitioner ability.
- Sensitive topics and deeply Chinese cultural items were removed, reducing some domain coverage.
When Not To Use
- Do not use Xiezhi as the sole metric for conversational or instruction-following ability.
- Avoid using Xiezhi-All (auto-annotated) as definitive ground truth without manual checks.
- Not ideal for culture-specific knowledge evaluation outside Chinese/English without further localization.
Failure Modes
- Auto-annotation misses secondary labels, producing incomplete discipline tags.
- Generative probability ranking can be expensive to compute at scale for large option sets.
- High-option MCQs may over-penalize models that are good at open-ended reasoning but poor at ranking many distractors.
- Translation artifacts can cause English-evaluated models to misinterpret questions.
Core Entities
Models
- GPT-4
- ChatGPT
- LLaMA
- BLOOM
- BLOOMZ
- GPT-NeoX
- Pythia
- Vicuna
- Baize
- BELLE
- DoctorGLM
Metrics
- MRR
- Hit@1
- Hit@4
- Accuracy
- Mean Rank
Datasets
- Xiezhi-All
- Xiezhi-Meta
- Xiezhi-Train
- Xiezhi-Specialty
- Xiezhi-Interdiscipline
- MMLU
- C-Eval
- M3KE
Benchmarks
- Xiezhi
- MMLU
- C-Eval
- M3KE
- BIG-bench
Context Entities
Models
- GPT-3.5
- Falcon
- StableLM
- MOSS
- H2O-GPT
Datasets
- Chinese Graduate Entrance Exams
- Open academic surveys/reviews (question generation source)

