Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
12
Why It Matters For Business
LLMs do not reliably store factual knowledge: product features that assume accurate factual recall (search, knowledge APIs, assistants) should keep symbolic knowledge sources or retrieval layers for long-tail and critical facts.
Summary TLDR
The authors release Head-to-Tail, an 18K QA benchmark built from DBpedia, IMDb, Goodreads, MAG and DBLP to measure how much factual knowledge LLMs confidently internalize. Evaluating 16 public LLMs, they find even the best model (GPT-4) scores ~31% overall and accuracy falls from head -> torso -> tail entities. Instruction tuning, model size, and basic prompting do not reliably fix missing or hallucinated facts. The paper argues for hybrid 'Dual Neural KGs' that keep symbolic triples for long-tail/recent facts and LLMs for smooth conversation.
Problem Statement
Can we measure how much factual knowledge LLMs actually store inside their parameters, and do LLMs already replace symbolic knowledge graphs for popular and long-tail facts?
Main Contribution
Head-to-Tail: an 18,171-question benchmark stratified by entity popularity (head/torso/tail) across DBpedia, Movie, Book, and Academics domains.
A practical, automated evaluation pipeline using three metrics—accuracy (A), hallucination rate (H), and missing rate (M)—and both rule-based and LLM-based judges.
A comprehensive evaluation of 16 public LLMs showing clear gaps in factual recall, and a discussion proposing hybrid symbolic+neural knowledge systems.
Key Findings
Best overall QA accuracy on Head-to-Tail is low.
Accuracy declines from head to torso to tail entities.
Different models show different failure modes (unsure vs hallucinate).
Instruction tuning and larger parameter count do not guarantee higher factual recall.
Prompting for 'unsure' and concise answers reduces hallucination and stabilizes outputs.
LLM-based correctness checks align strongly with rule-based metrics.
Results
Accuracy
Accuracy
High hallucination in some base models (HLM)
Effect of 'unsure' option on hallucination
Accuracy
Who Should Care
What To Try In 7 Days
Run Head-to-Tail on your LLM to quantify head/torso/tail gaps.
Add a retrieval step (documents or KG) for torso/tail queries and compare accuracy.
Prompt the model to return 'unsure' for low-confidence answers to reduce hallucination in UIs.
Reproducibility
Data Urls
- Head-to-Tail (planned release) and sources: DBpedia, IMDb, Goodreads, MAG, DBLP (snapshots referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Does not evaluate taxonomy or deep type hierarchies (left as future work).
- Benchmark focuses on concise question forms; paraphrase robustness and other query styles are not exhaustively tested.
- Possible overlap between LLM training data and benchmark items (training-data leakage) not fully controlled.
When Not To Use
- Do not rely on raw LLM outputs as the sole source for factual answers on long-tail or critical data.
- Avoid using these raw LLM recall numbers to prove general knowledge—results are specific to the Head-to-Tail templates and snapshots used.
Failure Modes
- Hallucination: confidently wrong answers, especially in some base models.
- Missing answers: conservative models return 'unsure' and omit useful facts.
- Domain gaps: poor accuracy for academic/long-tail domains.
Core Entities
Models
- GPT-4
- ChatGPT (gpt-3.5-turbo)
- LLaMA (7B/13B/33B/65B)
- Llama 2 (70B)
- Vicuna (7B/13B)
- Flan-T5 (3B/11B)
- RWKV (7B)
- Falcon (7B/40B)
- Falcon-Instruct (7B/40B)
Metrics
- Accuracy
- HLM (LLM-judged hallucination rate)
- M (missing rate)
- AEM (exact match)
- AF1 (token F1)
- ARL (ROUGE-L)
Datasets
- Head-to-Tail (18,171 QA)
- DBpedia (Dec 1, 2022 snapshot)
- IMDb (May 21, 2023 snapshot)
- Goodreads (2017 scrape)
- MAG (Sep 13, 2021)
- DBLP (May 10, 2023)
Benchmarks
- Head-to-Tail

