Head-to-Tail: a 18K-question benchmark showing LLMs are far from perfect on factual knowledge, especially long-tail facts.

August 20, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

12

Authors

Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, Xin Luna Dong

Links

Abstract / PDF

Why It Matters For Business

LLMs do not reliably store factual knowledge: product features that assume accurate factual recall (search, knowledge APIs, assistants) should keep symbolic knowledge sources or retrieval layers for long-tail and critical facts.

Summary TLDR

The authors release Head-to-Tail, an 18K QA benchmark built from DBpedia, IMDb, Goodreads, MAG and DBLP to measure how much factual knowledge LLMs confidently internalize. Evaluating 16 public LLMs, they find even the best model (GPT-4) scores ~31% overall and accuracy falls from head -> torso -> tail entities. Instruction tuning, model size, and basic prompting do not reliably fix missing or hallucinated facts. The paper argues for hybrid 'Dual Neural KGs' that keep symbolic triples for long-tail/recent facts and LLMs for smooth conversation.

Problem Statement

Can we measure how much factual knowledge LLMs actually store inside their parameters, and do LLMs already replace symbolic knowledge graphs for popular and long-tail facts?

Main Contribution

Head-to-Tail: an 18,171-question benchmark stratified by entity popularity (head/torso/tail) across DBpedia, Movie, Book, and Academics domains.

A practical, automated evaluation pipeline using three metrics—accuracy (A), hallucination rate (H), and missing rate (M)—and both rule-based and LLM-based judges.

A comprehensive evaluation of 16 public LLMs showing clear gaps in factual recall, and a discussion proposing hybrid symbolic+neural knowledge systems.

Key Findings

Best overall QA accuracy on Head-to-Tail is low.

NumbersGPT-4 ALM = 30.9% (Table 3)

Accuracy declines from head to torso to tail entities.

NumbersGPT-4 ALM: head 40.3% → torso 33.4% → tail 19.0% (domain-aggregated)

Different models show different failure modes (unsure vs hallucinate).

NumbersGPT-4 HLM ~19.7% and substantial M; LLaMA-33B HLM ≈ 80% (Table 3)

Instruction tuning and larger parameter count do not guarantee higher factual recall.

NumbersLLaMA-33B only slightly outperforms LLaMA-65B (∆ALM ≈ +0.4%) and instruction-tuned variants can raise missing rate

Prompting for 'unsure' and concise answers reduces hallucination and stabilizes outputs.

NumbersRemoving 'unsure' increases hallucination by ~13 percentage points; brief+unsure reduces regeneration variance to ~1%

LLM-based correctness checks align strongly with rule-based metrics.

NumbersSpearman ρ (ALM vs AEM) min 0.721 mean 0.915; Pearson r mean ≈ 0.966 (Table 8)

Results

Accuracy

Value30.9%

Accuracy

Valuehead 40.3% / torso 33.4% / tail 19.0%

High hallucination in some base models (HLM)

Value≈80%

Effect of 'unsure' option on hallucination

Value≈13 percentage points reduction

Baselineprompt without 'unsure'

Accuracy

ValuePearson r mean ≈ 0.966 (ALM vs AEM)

Who Should Care

What To Try In 7 Days

Run Head-to-Tail on your LLM to quantify head/torso/tail gaps.

Add a retrieval step (documents or KG) for torso/tail queries and compare accuracy.

Prompt the model to return 'unsure' for low-confidence answers to reduce hallucination in UIs.

Reproducibility

Data Urls

  • Head-to-Tail (planned release) and sources: DBpedia, IMDb, Goodreads, MAG, DBLP (snapshots referenced in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not evaluate taxonomy or deep type hierarchies (left as future work).
  • Benchmark focuses on concise question forms; paraphrase robustness and other query styles are not exhaustively tested.
  • Possible overlap between LLM training data and benchmark items (training-data leakage) not fully controlled.

When Not To Use

  • Do not rely on raw LLM outputs as the sole source for factual answers on long-tail or critical data.
  • Avoid using these raw LLM recall numbers to prove general knowledge—results are specific to the Head-to-Tail templates and snapshots used.

Failure Modes

  • Hallucination: confidently wrong answers, especially in some base models.
  • Missing answers: conservative models return 'unsure' and omit useful facts.
  • Domain gaps: poor accuracy for academic/long-tail domains.

Core Entities

Models

  • GPT-4
  • ChatGPT (gpt-3.5-turbo)
  • LLaMA (7B/13B/33B/65B)
  • Llama 2 (70B)
  • Vicuna (7B/13B)
  • Flan-T5 (3B/11B)
  • RWKV (7B)
  • Falcon (7B/40B)
  • Falcon-Instruct (7B/40B)

Metrics

  • Accuracy
  • HLM (LLM-judged hallucination rate)
  • M (missing rate)
  • AEM (exact match)
  • AF1 (token F1)
  • ARL (ROUGE-L)

Datasets

  • Head-to-Tail (18,171 QA)
  • DBpedia (Dec 1, 2022 snapshot)
  • IMDb (May 21, 2023 snapshot)
  • Goodreads (2017 scrape)
  • MAG (Sep 13, 2021)
  • DBLP (May 10, 2023)

Benchmarks

  • Head-to-Tail