Head-to-Tail: a 18K-question benchmark showing LLMs are far from perfect on factual knowledge, especially long-tail facts.

Overview

Decision SnapshotNeeds Validation

The benchmark is comprehensive and well-instrumented; evaluations across 16 public LLMs give consistent patterns, but results depend on prompt choices and training-data overlaps.

Citations12

Evidence Strength0.90

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, Xin Luna Dong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs do not reliably store factual knowledge: product features that assume accurate factual recall (search, knowledge APIs, assistants) should keep symbolic knowledge sources or retrieval layers for long-tail and critical facts.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors release Head-to-Tail, an 18K QA benchmark built from DBpedia, IMDb, Goodreads, MAG and DBLP to measure how much factual knowledge LLMs confidently internalize. Evaluating 16 public LLMs, they find even the best model (GPT-4) scores ~31% overall and accuracy falls from head -> torso -> tail entities. Instruction tuning, model size, and basic prompting do not reliably fix missing or hallucinated facts. The paper argues for hybrid 'Dual Neural KGs' that keep symbolic triples for long-tail/recent facts and LLMs for smooth conversation.

Problem Statement

Can we measure how much factual knowledge LLMs actually store inside their parameters, and do LLMs already replace symbolic knowledge graphs for popular and long-tail facts?

Main Contribution

Head-to-Tail: an 18,171-question benchmark stratified by entity popularity (head/torso/tail) across DBpedia, Movie, Book, and Academics domains.

A practical, automated evaluation pipeline using three metrics—accuracy (A), hallucination rate (H), and missing rate (M)—and both rule-based and LLM-based judges.

Key Findings

Best overall QA accuracy on Head-to-Tail is low.

NumbersGPT-4 ALM = 30.9% (Table 3)

Practical UseDo not assume an LLM reliably knows factual answers; validate facts or add retrieval before using LLMs as authoritative sources.

Evidence RefTable 3, Section 3.2

Accuracy declines from head to torso to tail entities.

NumbersGPT-4 ALM: head 40.3% → torso 33.4% → tail 19.0% (domain-aggregated)

Practical UseExpect worse LLM performance on less-popular or long-tail entities; add external knowledge for those cases.

Evidence RefSection 3.3, GPT-4 breakdown (figure/table)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	30.9%	—	—	Head-to-Tail (all)	Table 3: GPT-4 ALM = 30.9%	Table 3
Accuracy	head 40.3% / torso 33.4% / tail 19.0%	—	head→tail −21.3pp	GPT-4, Head-to-Tail (domain-aggregated)	Section 3.3, GPT-4 per-bucket breakdown	Section 3.3

What To Try In 7 Days

Run Head-to-Tail on your LLM to quantify head/torso/tail gaps.

Add a retrieval step (documents or KG) for torso/tail queries and compare accuracy.

Prompt the model to return 'unsure' for low-confidence answers to reduce hallucination in UIs.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/facebookresearch/head-to-tail

Data URLs

Head-to-Tail (planned release) and sources: DBpedia, IMDb, Goodreads, MAG, DBLP (snapshots referenced in paper)

Risks & Boundaries

Limitations

Does not evaluate taxonomy or deep type hierarchies (left as future work).

Benchmark focuses on concise question forms; paraphrase robustness and other query styles are not exhaustively tested.

When Not To Use

Do not rely on raw LLM outputs as the sole source for factual answers on long-tail or critical data.

Avoid using these raw LLM recall numbers to prove general knowledge—results are specific to the Head-to-Tail templates and snapshots used.

Failure Modes

Hallucination: confidently wrong answers, especially in some base models.

Missing answers: conservative models return 'unsure' and omit useful facts.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)LLaMA (7B/13B/33B/65B)Llama 2 (70B)Vicuna (7B/13B)Flan-T5 (3B/11B)RWKV (7B)Falcon (7B/40B)Falcon-Instruct (7B/40B)

Metrics

AccuracyHLM (LLM-judged hallucination rate)M (missing rate)AEM (exact match)AF1 (token F1)ARL (ROUGE-L)

Datasets

Head-to-Tail (18,171 QA)DBpedia (Dec 1, 2022 snapshot)IMDb (May 21, 2023 snapshot)Goodreads (2017 scrape)MAG (Sep 13, 2021)DBLP (May 10, 2023)

Benchmarks

Head-to-Tail

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Best overall QA accuracy on Head-to-Tail is low.

Accuracy declines from head to torso to tail entities.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding