Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

August 3, 20258 min

Overview

Decision SnapshotNeeds Validation

High conceptual clarity and comprehensive literature coverage. Theoretical inevitability claim is strong but depends on the computability formalism; empirical effect sizes vary by task and benchmark.

Citations0

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 60%

Authors

Manuel Cossio

Links

Abstract / PDF

Why It Matters For Business

Hallucinations can damage trust, cause legal/financial harm, and break workflows. Because some hallucination is inevitable, companies must design systems that detect, ground, and escalate risky outputs rather than assuming perfect model truthfulness.

Who Should Care

Summary TLDR

This 55-page survey formalizes what 'hallucination' means for LLMs, argues (with computability proofs) that some hallucination is inevitable, maps many hallucination subtypes (factual, intrinsic/extrinsic, temporal, ethical, code, multimodal, etc.), surveys benchmarks and metrics, and reviews mitigation patterns (RAG, tool calls, fine-tuning, guardrails, human-in-loop). Practical message: you cannot fully eliminate hallucination; focus on detection, grounding, hybrid safeguards, and human oversight.

Problem Statement

LLMs often generate plausible but incorrect or fabricated content. The field lacks a unified taxonomy and reliable, task-aware evaluation methods. The paper asks: what kinds of hallucination exist, why do they happen, how should we measure them, and how should practitioners mitigate their harms?

Main Contribution

A formal definition of hallucination and theoretical proofs that hallucination is unavoidable for computable LLMs.

A detailed, practical taxonomy that separates intrinsic vs extrinsic and factuality vs faithfulness, followed by many concrete subtypes (temporal, ethical, amalgamated, code, multimodal).

Key Findings

Hallucination is provably unavoidable for computable LLMs.

Practical UseDo not plan for perfect elimination. Build systems that detect and contain hallucinations (retrieval, tool calls, guardrails, human review).

Evidence RefSection 2.2 (Theorems 1–3) referencing Xu et al. [100]

Logical inconsistencies form a non-trivial share of hallucinations (reported 19%).

Numbers19% of identified hallucination cases (Section 4.4)

Practical UseAdd internal consistency checks and logic validators (e.g., calculation/execution checks) for tasks that need reasoning.

Evidence RefSection 4.4 (cites [42;47;34;95])

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Logical inconsistency share19%Aggregated hallucination analyses (cited sources)Section 4.4 (19% of cases; cites [42;47;34;95])[42;47;34;95]
Temporal disorientation share12%Aggregated hallucination analyses (cited sources)Section 4.5 (12% of cases; cites [47;51])[47;51]

What To Try In 7 Days

Add a simple retrieval step (search Wikipedia or internal docs) before answering time-sensitive user queries.

Surface source links and timestamps on model claims so users can verify quickly.

Implement a rule-based fallback: when confidence is low, ask for clarification or route to human review instead of inventing answers.

Agent Features

Tool Use
Toolformer-style API/tool callsExternal calculator/code execution

Optimization Features

System Optimization
Hybrid mitigation combining RAG, tools, and guardrails
Training Optimization
Fine-tuning with synthetic/adversarially filtered data
Inference Optimization
Tool delegation to external servicesRetrieval grounding to reduce parametric reliance

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Theoretical inevitability is proven in a formal computability setting; practical models and tasks may behave differently.

Benchmarks are fragmented and task-dependent; no single metric detects all hallucination types.

When Not To Use

Do not rely on vanilla LLM outputs alone for high-stakes medical, legal, or financial decisions.

Avoid autonomous deployment without retrieval grounding and human oversight for safety-critical tasks.

Failure Modes

Overconfidence: fluent but incorrect outputs that mislead users.

Adversarial or fabricated prompts that induce false elaboration.

Core Entities

Models

GPT-4/GPT-4o (referenced examples)Google BardClaude (Anthropic)LLaMA familyDeepSeek-R1o3-mini / o4-mini (OpenAI examples)

Metrics

ROUGEBLEUBERTScoreFactCCSummaCKILT-style grounding checksRetrieval-Augmented Evaluation (RAE)Human correctness/faithfulness/coherence/ harmfulness labels

Datasets

TruthfulQAHalluLensFActScoreQ2 (Quality Questioning)QuestEvalMedHalluMedHallBenchMed-HALTCodeHaluEvalHALLUCINOGENKILTPubMedQA (source for MedHallu)

Benchmarks

TruthfulQAHalluLensFActScoreQ2/QuestEvalMedHallu/MedHallBench/Med-HALTCodeHaluEvalHALLUCINOGENLM Arena (open battles)Epoch AI benchmarking dashboardVectara Hallucination LeaderboardArtificial Analysis Intelligence Index