Hallucination Detection Papers — Parsed & Scored for Practitioners

A broad third-party benchmark shows ChatGPT is a strong zero-shot performer but an unreliable reasoner and prone to hallucination

0.70

0.25

0.75

352

ChatGPT is a practical zero-shot workhorse: it saves time on many tasks and can replace some fine-tuned models for quick proofs of concept, but its factual and reasoning errors mean you must validate outputs before customer-facing or safety-critical use.

Key finding

ChatGPT often outperforms prior zero-shot LLMs.

Numbers: 9/13 evaluated datasets (zero-shot comparisons)

Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

0.70

0.50

0.60

233

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Key finding

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

0.70

0.40

0.60

207

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Key finding

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Augment ChatGPT with retrieved evidence and automated feedback to cut hallucinations

0.60

0.55

0.45

144

You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.

Key finding

Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.

Numbers: KF1: 26.71 -> 36.41 (ChatGPT -> LLM-AUGMENTER, News Chat, Table 1)

A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

0.40

0.35

0.50

91

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Key finding

GPT-4 ranks best across the eight chemistry tasks.

Numbers: Average rank: GPT-4 = 1.25 (Table 2).

Practical survey of what makes LLMs factual, how we test it, and how to fix it

0.60

0.40

0.60

52

LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.

Key finding

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

Numbers: FActScore range 42%–71% for commercial LLMs on biographies

Black-box prompts plus sampling help, but LLMs stay overconfident and struggle to predict failures

0.40

0.45

0.55

49

When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.

Key finding

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Numbers: confidence values mostly in 80–100% range; many expressed in multiples of 5

Find and fix contradictions in an LLM's own text without web lookups

0.70

0.60

46

Automate contradiction checks to catch internal hallucinations that retrieval misses, improving trust in long-form outputs and answers with modest extra cost.

Key finding

Self-contradictions are common in open-domain generations.

Numbers: 17.7% of sentences for ChatGPT (MainTestSet)

Systematic benchmark: GPT-series and LLaMA variants vs. fine-tuned BioNLP models across 12 biomedical tasks

0.60

0.40

0.70

41

If you need high-accuracy extraction or classification in biomedical text, fine-tuned domain models remain the practical choice; use GPT-4 for reasoning or prototyping high-level QA but budget for much higher inference costs and add output validation.

Key finding

Fine-tuned, domain-specific models still outperform zero- and few-shot LLMs on most BioNLP tasks.

Numbers: Macro-average: SOTA fine-tuned 0.6536 vs. best LLM zero/few-shot ~0.51

Automated audit finds many medical LLM answers lack supporting sources

0.60

0.55

0.65

39

LLMs used in healthcare often cite sources that do not back their claims. That creates legal, safety, and trust risks for any product that displays model-cited medical advice.

Key finding

GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.

Numbers: 88.0% agreement (N=284) vs doctor consensus

ToolQA — a benchmark that forces LLMs to use external tools, not memorized facts

0.30

0.40

0.20

39

If your product must use live or private data, you need tested tool integration and source selection; relying on a base LLM risks wrong or outdated answers.

Key finding

Standard LLMs that do not use external tools fail on ToolQA.

Numbers: ChatGPT avg success: 5.6% (easy), ~2% (hard)

Survey of how LLMs produce and spread factual errors—and what to do about it

0.40

0.35

0.55

33

LLMs can produce plausible-sounding falsehoods and leak sensitive inputs; unchecked use creates legal, reputational, and operational risk for any organization that relies on automated text.

Key finding

During COVID-era chatbot use, health topics were very common: 30% of 6,594 user-chatbot interactions used the keyword 'COVID-19'.

Numbers: 30% of 6,594 interactions

Detect hallucinated facts from any black‑box LLM by sampling its own alternative outputs

0.60

0.70

0.45

33

You can flag likely false claims from closed-source LLMs without buying or building knowledge bases; this reduces misinformation risk in customer-facing text generation.

Key finding

Prompt-based SelfCheckGPT achieved the strongest results at both sentence and passage levels.

Numbers: Sentence AUC-PR (NonFact)=93.42; Passage Pearson=78.32 (Table 2)

Top legal AI tools still hallucinate: 17–33% of answers are false or misleading

0.45

0.25

0.30

32

Major legal AI products still produce false or misleading legal claims often enough that lawyers must verify outputs, which affects liability, trust, and the realized efficiency gains.

Key finding

Lexis+ AI provided accurate (correct + grounded) answers for 65% of queries.

Numbers: 65% accuracy (Figure 4; Section 6.1)

Taxonomy and lightweight mitigation of hallucinations in LLM-generated code

0.50

0.60

30

Hallucinations in LLM-generated code often break functionality and raise debugging, maintenance, and security costs; detecting and reducing them yields outsized gains in correctness without retraining models.

Key finding

Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.

Numbers: 1,134/3,120 samples; 1,212 hallucinatory snippets

HaluEval: 35k test cases (human + synthetic) to measure whether LLMs spot made-up facts.

0.70

0.60

0.50

29

Models can produce believable but false facts. That creates risk for customer-facing apps, search, and decision tools. HaluEval lets you measure how often your model fabricates facts and whether it can flag them.

Key finding

ChatGPT produces unverifiable or conflicting statements in a sizable fraction of real user responses.

Numbers: 977 of 5,000 annotated responses (19.5%)

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

0.50

0.60

0.70

26

Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.

Key finding

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

Numbers: AY >80%; CH <10% (Figure 2, Appendix Tables 9-11)

KoLA: a focused, evolving benchmark that measures LLM world knowledge and flags hallucinated creations

0.70

0.50

0.60

24

KoLA gives a practical, evolving way to compare models on factual recall, understanding, reasoning, and creation while flagging hallucinated facts automatically—helpful when choosing models for QA, knowledge work, or content generation.

Key finding

Model size strongly predicts memorization for non-aligned models.

Numbers: Spearman ρ = 0.79 between KM rank and model size (non-aligned models)

Woodpecker: a training-free post-hoc pipeline that finds and fixes image hallucinations with vision experts

0.60

0.70

22

You can reduce image-based hallucinations and raise trust without retraining models by adding a post-hoc verifier that extracts claims, checks them with detectors/VQA, and rewrites outputs with bounding-box evidence.

Key finding

Applying Woodpecker to MiniGPT-4 increased POPE object-existence accuracy from 54.67% to 85.33%.

Numbers: 54.67% → 85.33% (Δ +30.66)

KaLMA + BioKaLMA: benchmark and metrics to attribute LLM outputs to knowledge graphs

0.45

0.60

0.40

22

Attributing LLM outputs to structured KGs and marking missing facts ([NA]) makes generated content more verifiable and helps reduce risk in finance, law, and healthcare where factual traceability matters.

Key finding

Benchmark size and scope

Numbers: 1,085 entries; avg 6.8 KG facts per question

Pretraining memory and corpus-frequency biases drive much of LLM hallucination on inference

0.30

0.50

0.20

18

LLMs can assert conclusions drawn from their training data or corpus statistics rather than the given context. That puts QA, summarization, and policy extraction at risk of silent misinformation; apply attestation checks and bias-controlled tests before deployment.

Key finding

Attestation (memorized sentence) strongly raises false positive entailments.

Numbers: False Entail chance 1.9x (LLaMA), 2.2x (GPT-3.5), 2.0x (PaLM)

At decode time, subtract earlier-layer logits from later-layer logits to reduce hallucinations.

0.70

0.55

0.15

17

DoLa boosts factual output from large pretrained LMs without retraining or external retrieval, giving immediate, low-cost improvements for truth-sensitive products like QA assistants and chatbots.

Key finding

DoLa raises combined truthfulness×informativeness on open-ended TruthfulQA by about 12–17 absolute percentage points for LLaMA models.

Numbers: 12–17 pp improvement on %Truth*Info across LLaMA sizes (Table 1)

Med-HALT: a public benchmark that tests LLM hallucinations on medical multiple-choice and PubMed retrieval tasks

0.20

0.55

0.30

17

If you plan to use LLMs for medical content or literature retrieval, expect frequent confident errors unless you add external retrieval, verification, or human oversight; Med‑HALT lets you measure that risk quantitatively.

Key finding

No model achieved clinical-grade accuracy on reasoning hallucination tests.

Numbers: Llama‑2 70B Reasoning FCT accuracy 42.21% (Table 2)

Use an iterative generate-score-refine loop to cut hallucinated answers from medical LLMs

0.30

0.55

0.25

17

Adding an iterative generate-score-refine step reduces irrelevant and factually inconsistent medical answers, lowering risk and improving trust for AI assistants used in healthcare workflows.

Key finding

Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.

Numbers: Vicuna: 0.4684 -> 0.6380 (+0.1696); ChatGPT: 0.5850 -> 0.6824 (+0.0974)