Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Overview

Decision SnapshotNeeds Validation

High conceptual clarity and comprehensive literature coverage. Theoretical inevitability claim is strong but depends on the computability formalism; empirical effect sizes vary by task and benchmark.

Citations0

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 60%

Authors

Manuel Cossio

Links

Abstract / PDF

Why It Matters For Business

Hallucinations can damage trust, cause legal/financial harm, and break workflows. Because some hallucination is inevitable, companies must design systems that detect, ground, and escalate risky outputs rather than assuming perfect model truthfulness.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This 55-page survey formalizes what 'hallucination' means for LLMs, argues (with computability proofs) that some hallucination is inevitable, maps many hallucination subtypes (factual, intrinsic/extrinsic, temporal, ethical, code, multimodal, etc.), surveys benchmarks and metrics, and reviews mitigation patterns (RAG, tool calls, fine-tuning, guardrails, human-in-loop). Practical message: you cannot fully eliminate hallucination; focus on detection, grounding, hybrid safeguards, and human oversight.

Problem Statement

LLMs often generate plausible but incorrect or fabricated content. The field lacks a unified taxonomy and reliable, task-aware evaluation methods. The paper asks: what kinds of hallucination exist, why do they happen, how should we measure them, and how should practitioners mitigate their harms?

Main Contribution

A formal definition of hallucination and theoretical proofs that hallucination is unavoidable for computable LLMs.

A detailed, practical taxonomy that separates intrinsic vs extrinsic and factuality vs faithfulness, followed by many concrete subtypes (temporal, ethical, amalgamated, code, multimodal).

Key Findings

Hallucination is provably unavoidable for computable LLMs.

Practical UseDo not plan for perfect elimination. Build systems that detect and contain hallucinations (retrieval, tool calls, guardrails, human review).

Evidence RefSection 2.2 (Theorems 1–3) referencing Xu et al. [100]

Logical inconsistencies form a non-trivial share of hallucinations (reported 19%).

Numbers19% of identified hallucination cases (Section 4.4)

Practical UseAdd internal consistency checks and logic validators (e.g., calculation/execution checks) for tasks that need reasoning.

Evidence RefSection 4.4 (cites [42;47;34;95])

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Logical inconsistency share	19%	—	—	Aggregated hallucination analyses (cited sources)	Section 4.4 (19% of cases; cites [42;47;34;95])	[42;47;34;95]
Temporal disorientation share	12%	—	—	Aggregated hallucination analyses (cited sources)	Section 4.5 (12% of cases; cites [47;51])	[47;51]

What To Try In 7 Days

Add a simple retrieval step (search Wikipedia or internal docs) before answering time-sensitive user queries.

Surface source links and timestamps on model claims so users can verify quickly.

Implement a rule-based fallback: when confidence is low, ask for clarification or route to human review instead of inventing answers.

Agent Features

Tool Use

Toolformer-style API/tool callsExternal calculator/code execution

Optimization Features

System Optimization

Hybrid mitigation combining RAG, tools, and guardrails

Training Optimization

Fine-tuning with synthetic/adversarially filtered data

Inference Optimization

Tool delegation to external servicesRetrieval grounding to reduce parametric reliance

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Theoretical inevitability is proven in a formal computability setting; practical models and tasks may behave differently.

Benchmarks are fragmented and task-dependent; no single metric detects all hallucination types.

When Not To Use

Do not rely on vanilla LLM outputs alone for high-stakes medical, legal, or financial decisions.

Avoid autonomous deployment without retrieval grounding and human oversight for safety-critical tasks.

Failure Modes

Overconfidence: fluent but incorrect outputs that mislead users.

Adversarial or fabricated prompts that induce false elaboration.

Core Entities

Models

GPT-4/GPT-4o (referenced examples)Google BardClaude (Anthropic)LLaMA familyDeepSeek-R1o3-mini / o4-mini (OpenAI examples)

Metrics

ROUGEBLEUBERTScoreFactCCSummaCKILT-style grounding checksRetrieval-Augmented Evaluation (RAE)Human correctness/faithfulness/coherence/ harmfulness labels

Datasets

TruthfulQAHalluLensFActScoreQ2 (Quality Questioning)QuestEvalMedHalluMedHallBenchMed-HALTCodeHaluEvalHALLUCINOGENKILTPubMedQA (source for MedHallu)

Benchmarks

TruthfulQAHalluLensFActScoreQ2/QuestEvalMedHallu/MedHallBench/Med-HALTCodeHaluEvalHALLUCINOGENLM Arena (open battles)Epoch AI benchmarking dashboardVectara Hallucination LeaderboardArtificial Analysis Intelligence Index

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hallucination is provably unavoidable for computable LLMs.

Logical inconsistencies form a non-trivial share of hallucinations (reported 19%).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding