Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

August 3, 20258 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Manuel Cossio

Links

Abstract / PDF

Why It Matters For Business

Hallucinations can damage trust, cause legal/financial harm, and break workflows. Because some hallucination is inevitable, companies must design systems that detect, ground, and escalate risky outputs rather than assuming perfect model truthfulness.

Summary TLDR

This 55-page survey formalizes what 'hallucination' means for LLMs, argues (with computability proofs) that some hallucination is inevitable, maps many hallucination subtypes (factual, intrinsic/extrinsic, temporal, ethical, code, multimodal, etc.), surveys benchmarks and metrics, and reviews mitigation patterns (RAG, tool calls, fine-tuning, guardrails, human-in-loop). Practical message: you cannot fully eliminate hallucination; focus on detection, grounding, hybrid safeguards, and human oversight.

Problem Statement

LLMs often generate plausible but incorrect or fabricated content. The field lacks a unified taxonomy and reliable, task-aware evaluation methods. The paper asks: what kinds of hallucination exist, why do they happen, how should we measure them, and how should practitioners mitigate their harms?

Main Contribution

A formal definition of hallucination and theoretical proofs that hallucination is unavoidable for computable LLMs.

A detailed, practical taxonomy that separates intrinsic vs extrinsic and factuality vs faithfulness, followed by many concrete subtypes (temporal, ethical, amalgamated, code, multimodal).

A survey of evaluation datasets and metrics (TruthfulQA, HalluLens, FActScore, Q2/QuestEval, MedHallu, CodeHaluEval, HALLUCINOGEN) and their blind spots.

A catalog of mitigation strategies and a practical recommendation: hybrid systems combining retrieval, tool use, fine-tuning, symbolic guardrails, and human-in-the-loop controls.

A curated list of web resources and leaderboards (Artificial Analysis, Vectara Hallucination Leaderboard, Epoch AI, LM Arena) for monitoring model behavior over time.

Key Findings

Hallucination is provably unavoidable for computable LLMs.

Logical inconsistencies form a non-trivial share of hallucinations (reported 19%).

Numbers19% of identified hallucination cases (Section 4.4)

Temporal errors account for a measurable share (reported 12%) and stem from outdated training data.

Numbers12% of identified hallucination cases (Section 4.5)

Ethical or harmful hallucinations are present and were measured at ~6% in some analyses.

Numbers6% of cases (Section 4.6)

Existing benchmarks and metrics are fragmented and miss subtle hallucinations; lack of standardization is a major limitation.

Combining approaches reduces risk: RAG, tool calls, fine-tuning, and symbolic guardrails work best together.

Model accuracy improves with compute and algorithmic progress; this correlates with lower hallucination tendency.

Numbers≈12 pp per 10x compute on GPQA Diamond; ≈17 pp per 10x on MATH L5 (Section 9.3.1)

Results

Logical inconsistency share

Value19%

Temporal disorientation share

Value12%

Ethical/harmful hallucination share

Value6%

Accuracy

Value+12 percentage points per 10x compute (approx.)

Accuracy

Value+17 percentage points per 10x compute (approx.)

Who Should Care

What To Try In 7 Days

Add a simple retrieval step (search Wikipedia or internal docs) before answering time-sensitive user queries.

Surface source links and timestamps on model claims so users can verify quickly.

Implement a rule-based fallback: when confidence is low, ask for clarification or route to human review instead of inventing answers.

Agent Features

Tool Use

  • Toolformer-style API/tool calls
  • External calculator/code execution

Optimization Features

System Optimization

  • Hybrid mitigation combining RAG, tools, and guardrails

Training Optimization

  • Fine-tuning with synthetic/adversarially filtered data

Inference Optimization

  • Tool delegation to external services
  • Retrieval grounding to reduce parametric reliance

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Theoretical inevitability is proven in a formal computability setting; practical models and tasks may behave differently.
  • Benchmarks are fragmented and task-dependent; no single metric detects all hallucination types.
  • Empirical statistics cited are aggregated from multiple sources with varying annotation protocols.
  • No shared code/data accompany this taxonomy to reproduce cross-benchmark comparisons.

When Not To Use

  • Do not rely on vanilla LLM outputs alone for high-stakes medical, legal, or financial decisions.
  • Avoid autonomous deployment without retrieval grounding and human oversight for safety-critical tasks.

Failure Modes

  • Overconfidence: fluent but incorrect outputs that mislead users.
  • Adversarial or fabricated prompts that induce false elaboration.
  • Temporal drift: outdated training data leading to stale facts.
  • Knowledge overshadowing and amalgamation that mix unrelated facts.

Core Entities

Models

  • GPT-4/GPT-4o (referenced examples)
  • Google Bard
  • Claude (Anthropic)
  • LLaMA family
  • DeepSeek-R1
  • o3-mini / o4-mini (OpenAI examples)

Metrics

  • ROUGE
  • BLEU
  • BERTScore
  • FactCC
  • SummaC
  • KILT-style grounding checks
  • Retrieval-Augmented Evaluation (RAE)
  • Human correctness/faithfulness/coherence/ harmfulness labels

Datasets

  • TruthfulQA
  • HalluLens
  • FActScore
  • Q2 (Quality Questioning)
  • QuestEval
  • MedHallu
  • MedHallBench
  • Med-HALT
  • CodeHaluEval
  • HALLUCINOGEN
  • KILT
  • PubMedQA (source for MedHallu)

Benchmarks

  • TruthfulQA
  • HalluLens
  • FActScore
  • Q2/QuestEval
  • MedHallu/MedHallBench/Med-HALT
  • CodeHaluEval
  • HALLUCINOGEN
  • LM Arena (open battles)
  • Epoch AI benchmarking dashboard
  • Vectara Hallucination Leaderboard
  • Artificial Analysis Intelligence Index