Head-to-Tail: a 18K-question benchmark showing LLMs are far from perfect on factual knowledge, especially long-tail facts.

August 20, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is comprehensive and well-instrumented; evaluations across 16 public LLMs give consistent patterns, but results depend on prompt choices and training-data overlaps.

Citations12

Evidence Strength0.90

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, Xin Luna Dong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs do not reliably store factual knowledge: product features that assume accurate factual recall (search, knowledge APIs, assistants) should keep symbolic knowledge sources or retrieval layers for long-tail and critical facts.

Who Should Care

Summary TLDR

The authors release Head-to-Tail, an 18K QA benchmark built from DBpedia, IMDb, Goodreads, MAG and DBLP to measure how much factual knowledge LLMs confidently internalize. Evaluating 16 public LLMs, they find even the best model (GPT-4) scores ~31% overall and accuracy falls from head -> torso -> tail entities. Instruction tuning, model size, and basic prompting do not reliably fix missing or hallucinated facts. The paper argues for hybrid 'Dual Neural KGs' that keep symbolic triples for long-tail/recent facts and LLMs for smooth conversation.

Problem Statement

Can we measure how much factual knowledge LLMs actually store inside their parameters, and do LLMs already replace symbolic knowledge graphs for popular and long-tail facts?

Main Contribution

Head-to-Tail: an 18,171-question benchmark stratified by entity popularity (head/torso/tail) across DBpedia, Movie, Book, and Academics domains.

A practical, automated evaluation pipeline using three metrics—accuracy (A), hallucination rate (H), and missing rate (M)—and both rule-based and LLM-based judges.

Key Findings

Best overall QA accuracy on Head-to-Tail is low.

NumbersGPT-4 ALM = 30.9% (Table 3)

Practical UseDo not assume an LLM reliably knows factual answers; validate facts or add retrieval before using LLMs as authoritative sources.

Evidence RefTable 3, Section 3.2

Accuracy declines from head to torso to tail entities.

NumbersGPT-4 ALM: head 40.3% → torso 33.4% → tail 19.0% (domain-aggregated)

Practical UseExpect worse LLM performance on less-popular or long-tail entities; add external knowledge for those cases.

Evidence RefSection 3.3, GPT-4 breakdown (figure/table)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy30.9%Head-to-Tail (all)Table 3: GPT-4 ALM = 30.9%Table 3
Accuracyhead 40.3% / torso 33.4% / tail 19.0%head→tail −21.3ppGPT-4, Head-to-Tail (domain-aggregated)Section 3.3, GPT-4 per-bucket breakdownSection 3.3

What To Try In 7 Days

Run Head-to-Tail on your LLM to quantify head/torso/tail gaps.

Add a retrieval step (documents or KG) for torso/tail queries and compare accuracy.

Prompt the model to return 'unsure' for low-confidence answers to reduce hallucination in UIs.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Head-to-Tail (planned release) and sources: DBpedia, IMDb, Goodreads, MAG, DBLP (snapshots referenced in paper)

Risks & Boundaries

Limitations

Does not evaluate taxonomy or deep type hierarchies (left as future work).

Benchmark focuses on concise question forms; paraphrase robustness and other query styles are not exhaustively tested.

When Not To Use

Do not rely on raw LLM outputs as the sole source for factual answers on long-tail or critical data.

Avoid using these raw LLM recall numbers to prove general knowledge—results are specific to the Head-to-Tail templates and snapshots used.

Failure Modes

Hallucination: confidently wrong answers, especially in some base models.

Missing answers: conservative models return 'unsure' and omit useful facts.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)LLaMA (7B/13B/33B/65B)Llama 2 (70B)Vicuna (7B/13B)Flan-T5 (3B/11B)RWKV (7B)Falcon (7B/40B)Falcon-Instruct (7B/40B)

Metrics

AccuracyHLM (LLM-judged hallucination rate)M (missing rate)AEM (exact match)AF1 (token F1)ARL (ROUGE-L)

Datasets

Head-to-Tail (18,171 QA)DBpedia (Dec 1, 2022 snapshot)IMDb (May 21, 2023 snapshot)Goodreads (2017 scrape)MAG (Sep 13, 2021)DBLP (May 10, 2023)

Benchmarks

Head-to-Tail