Overview
Xiezhi is ready for benchmarking research and internal model comparisons; teams should validate curated subsets for critical applications because auto-annotated parts can contain label noise.
Citations9
Evidence Strength0.80
Confidence0.82
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Yes
License: CC BY-SA 4.0
At A Glance
Cost impact: 30%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Xiezhi gives a broad, hard-to-game way to measure domain knowledge across many fields; it helps product and engineering teams spot domain blind spots and track small improvements in LLMs over time.
Who Should Care
Summary TLDR
Xiezhi is a large, auto-updating benchmark for domain knowledge in LLMs. It contains 249,587 multiple-choice questions across 516 disciplines (13 top-level categories). The authors provide two curated subsets (Xiezhi-Specialty: 14,041 Qs; Xiezhi-Interdiscipline: 10,746 Qs) and an annotated meta set (20,124 Qs). Novel evaluation rules: 50 answer options per question and ranking options by generative probability (Mean Reciprocal Rank metric). The repo and code are public. The paper evaluates 47 LLMs and finds GPT-4 tops Xiezhi (MRR≈0.43) while LLMs beat average human performance in some STEM and art domains but lag in economics, law, pedagogy, literature, history, and management. The dataset,
Problem Statement
Existing knowledge benchmarks are too small, too narrow, or quickly become part of LLM training data. They also use short multiple-choice formats (4 options) and answer-extraction prompts that favor models trained on MCQ-style data. We need a broader, fresher, and fairer way to measure domain knowledge and to reveal fine-grained differences among LLMs.
Main Contribution
Xiezhi-All: 249,587 multi-choice questions spanning 516 disciplines (13 categories).
Auto-updating pipeline: human-verified Xiezhi-Meta (20,124 Qs) trains an annotator to label ~170k exam Qs and ~80k generated Qs.
Key Findings
Xiezhi is very large and multi-disciplinary.
Moving from 4 to 50 options lowers random-guess signal and exposes model gaps.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 249,587 questions; 516 disciplines | — | — | Xiezhi-All | Abstract; Dataset Construction | Abstract; Dataset Construction |
| Random-guess MRR (50 options) | 0.089 | 4-option random guessing (higher effective baseline) | — | Xiezhi evaluation setting | Table 1 (Random-Guess rows) | Table 1 |
What To Try In 7 Days
Run Xiezhi-Specialty on your model to profile domain strengths and weaknesses.
Switch MCQ evaluation to probability-ranking (MRR) for generative models.
Add more distractors to key MCQ tests to reduce luck-driven gains (try 20–50 options).
Reproducibility
Risks & Boundaries
Limitations
Strong Chinese-source bias: many questions originate from Chinese exams and surveys.
English translation done by Google Translate plus manual edits; domain terms may be imperfect.
When Not To Use
Do not use Xiezhi as the sole metric for conversational or instruction-following ability.
Avoid using Xiezhi-All (auto-annotated) as definitive ground truth without manual checks.
Failure Modes
Auto-annotation misses secondary labels, producing incomplete discipline tags.
Generative probability ranking can be expensive to compute at scale for large option sets.

