Overview
The benchmark is comprehensive and well-instrumented; evaluations across 16 public LLMs give consistent patterns, but results depend on prompt choices and training-data overlaps.
Citations12
Evidence Strength0.90
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
LLMs do not reliably store factual knowledge: product features that assume accurate factual recall (search, knowledge APIs, assistants) should keep symbolic knowledge sources or retrieval layers for long-tail and critical facts.
Who Should Care
Summary TLDR
The authors release Head-to-Tail, an 18K QA benchmark built from DBpedia, IMDb, Goodreads, MAG and DBLP to measure how much factual knowledge LLMs confidently internalize. Evaluating 16 public LLMs, they find even the best model (GPT-4) scores ~31% overall and accuracy falls from head -> torso -> tail entities. Instruction tuning, model size, and basic prompting do not reliably fix missing or hallucinated facts. The paper argues for hybrid 'Dual Neural KGs' that keep symbolic triples for long-tail/recent facts and LLMs for smooth conversation.
Problem Statement
Can we measure how much factual knowledge LLMs actually store inside their parameters, and do LLMs already replace symbolic knowledge graphs for popular and long-tail facts?
Main Contribution
Head-to-Tail: an 18,171-question benchmark stratified by entity popularity (head/torso/tail) across DBpedia, Movie, Book, and Academics domains.
A practical, automated evaluation pipeline using three metrics—accuracy (A), hallucination rate (H), and missing rate (M)—and both rule-based and LLM-based judges.
Key Findings
Best overall QA accuracy on Head-to-Tail is low.
Accuracy declines from head to torso to tail entities.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 30.9% | — | — | Head-to-Tail (all) | Table 3: GPT-4 ALM = 30.9% | Table 3 |
| Accuracy | head 40.3% / torso 33.4% / tail 19.0% | — | head→tail −21.3pp | GPT-4, Head-to-Tail (domain-aggregated) | Section 3.3, GPT-4 per-bucket breakdown | Section 3.3 |
What To Try In 7 Days
Run Head-to-Tail on your LLM to quantify head/torso/tail gaps.
Add a retrieval step (documents or KG) for torso/tail queries and compare accuracy.
Prompt the model to return 'unsure' for low-confidence answers to reduce hallucination in UIs.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Does not evaluate taxonomy or deep type hierarchies (left as future work).
Benchmark focuses on concise question forms; paraphrase robustness and other query styles are not exhaustively tested.
When Not To Use
Do not rely on raw LLM outputs as the sole source for factual answers on long-tail or critical data.
Avoid using these raw LLM recall numbers to prove general knowledge—results are specific to the Head-to-Tail templates and snapshots used.
Failure Modes
Hallucination: confidently wrong answers, especially in some base models.
Missing answers: conservative models return 'unsure' and omit useful facts.

