Faithfulness Evaluation Papers — Parsed & Scored for Practitioners

Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

0.70

0.40

0.60

207

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Key finding

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Ragas: reference-free checks for RAG faithfulness, relevance, and context focus

0.60

0.50

65

Provides fast, automated checks to catch ungrounded answers and noisy retrieval, reducing time spent on manual labeling and lowering hallucination risk in RAG deployments.

Key finding

Ragas matches human judgements on faithfulness with very high accuracy.

Numbers: Faithfulness accuracy 0.95 on WikiEval (Table 1)

ChatGPT is weak at standard supervised IE but surprisingly strong at open extraction, explains itself well, yet is overconfident

0.50

0.40

59

ChatGPT can propose high-quality extraction candidates and readable explanations without labels, but it is not a drop-in replacement for supervised IE when you need precise, well-calibrated automated extraction.

Key finding

ChatGPT underperforms supervised baselines on Standard-IE tasks.

Numbers: Standard-IE full-test Micro-F1: ChatGPT often << SOTA (e.g., EE trigger/arg 16.6/7.8 vs SOTA ~72/56).

Black-box prompts plus sampling help, but LLMs stay overconfident and struggle to predict failures

0.40

0.45

0.55

49

When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.

Key finding

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Numbers: confidence values mostly in 80–100% range; many expressed in multiples of 5

Find and fix contradictions in an LLM's own text without web lookups

0.70

0.60

46

Automate contradiction checks to catch internal hallucinations that retrieval misses, improving trust in long-form outputs and answers with modest extra cost.

Key finding

Self-contradictions are common in open-domain generations.

Numbers: 17.7% of sentences for ChatGPT (MainTestSet)

Automated audit finds many medical LLM answers lack supporting sources

0.60

0.55

0.65

39

LLMs used in healthcare often cite sources that do not back their claims. That creates legal, safety, and trust risks for any product that displays model-cited medical advice.

Key finding

GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.

Numbers: 88.0% agreement (N=284) vs doctor consensus

RoG: Ground LLM plans on knowledge‑graph relation paths for faithful, interpretable KGQA

0.60

0.50

38

RoG reduces hallucinations by grounding LLM reasoning in KG facts and provides traceable, human-readable paths—this improves accuracy and trust on KG-backed QA without retraining every LLM.

Key finding

RoG sets new best scores on standard KGQA benchmarks.

Numbers: WebQSP Hits@1 85.7; F1 70.8. CWQ Hits@1 62.6; F1 56.2.

Detect hallucinated facts from any black‑box LLM by sampling its own alternative outputs

0.60

0.70

0.45

33

You can flag likely false claims from closed-source LLMs without buying or building knowledge bases; this reduces misinformation risk in customer-facing text generation.

Key finding

Prompt-based SelfCheckGPT achieved the strongest results at both sentence and passage levels.

Numbers: Sentence AUC-PR (NonFact)=93.42; Passage Pearson=78.32 (Table 2)

Top legal AI tools still hallucinate: 17–33% of answers are false or misleading

0.45

0.25

0.30

32

Major legal AI products still produce false or misleading legal claims often enough that lawyers must verify outputs, which affects liability, trust, and the realized efficiency gains.

Key finding

Lexis+ AI provided accurate (correct + grounded) answers for 65% of queries.

Numbers: 65% accuracy (Figure 4; Section 6.1)

Taxonomy and lightweight mitigation of hallucinations in LLM-generated code

0.50

0.60

30

Hallucinations in LLM-generated code often break functionality and raise debugging, maintenance, and security costs; detecting and reducing them yields outsized gains in correctness without retraining models.

Key finding

Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.

Numbers: 1,134/3,120 samples; 1,212 hallucinatory snippets

Woodpecker: a training-free post-hoc pipeline that finds and fixes image hallucinations with vision experts

0.60

0.70

22

You can reduce image-based hallucinations and raise trust without retraining models by adding a post-hoc verifier that extracts claims, checks them with detectors/VQA, and rewrites outputs with bounding-box evidence.

Key finding

Applying Woodpecker to MiniGPT-4 increased POPE object-existence accuracy from 54.67% to 85.33%.

Numbers: 54.67% → 85.33% (Δ +30.66)

KaLMA + BioKaLMA: benchmark and metrics to attribute LLM outputs to knowledge graphs

0.45

0.60

0.40

22

Attributing LLM outputs to structured KGs and marking missing facts ([NA]) makes generated content more verifiable and helps reduce risk in finance, law, and healthcare where factual traceability matters.

Key finding

Benchmark size and scope

Numbers: 1,085 entries; avg 6.8 KG facts per question

Hierarchical ReAct agents ground LLMs to Materials Project data and run language-driven simulations with near-zero hallucination

0.70

0.55

0.60

21

Grounding LLMs to authoritative databases and tools reduces dangerous hallucinations and lets teams automate reproducible workflows (data fetch → simulation → analysis) without model fine-tuning, cutting verification time and accelerating materials R&D.

Key finding

LLaMP reduces bulk-modulus prediction error compared to web-augmented GPT-4 and other baselines.

Numbers: Bulk modulus MAE = 14.57 GPa (LLaMP) vs ~41 GPa (GPT-4/GPT-4+Serp) on evaluated set

ChatGPT can produce word-level explanations that match classic methods on faithfulness but differ sharply in form and reliability

0.60

0.40

0.70

19

LLM self-explanations can replace expensive explainer runs for quick audits and UX features, but their coarseness and prompt sensitivity mean you must validate high-stakes uses with extra tests or humans.

Key finding

Self-explanations score similarly to LIME and occlusion on faithfulness metrics.

Numbers: E-P comprehensiveness: SELFEXP 0.19 vs LIME 0.17 (Table VII)

Practical checklist to measure, detect, and reduce LLM hallucinations in healthcare

0.40

0.20

0.30

14

In healthcare, LLM mistakes can harm patients and create liability. Measuring and mitigating hallucinations is necessary before deploying models in clinical workflows.

Key finding

A study found ~25% of generated summaries contained hallucinated content.

Numbers: 25% hallucinated summaries

MAIRA-2: a multimodal chest X‑ray model that generates grounded findings and RadFact, an LLM-based sentence-level evaluator

0.40

0.60

0.35

13

MAIRA-2 can produce editable, locally-grounded draft radiology findings and an LLM-based evaluator (RadFact); this shortens reviewer effort and supports rapid prototyping of clinical draft-assist tools, but human oversight remains mandatory.

Key finding

MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.

Numbers: ROUGE-L 38.4; BLEU-4 23.1; RadGraph-F1 34.6 (Table D.1)

LLM 'hallucinations' are narrative-rich confabulations that can improve coherence and may be useful

0.30

0.60

0.40

11

Hallucinations often produce more coherent, story-like text; that trait can be useful for product flows that prioritize readability, persuasion, or ideation, but it creates risk in truth-sensitive domains and needs human validation.

Key finding

Hallucinated dialog responses have higher mean narrativity than truthful responses on evaluated benchmarks

Numbers: FaithDial mean 0.620 vs truth 0.518 (Δ≈0.102); HaluEval 0.655 vs 0.638 (Δ≈0.017); BEGIN 0.658 vs 0.561 (Δ≈0.097)

Survey: When LLM hallucinations become a source of creativity

0.45

0.55

0.20

10

Hallucinations can be both a liability and a creative asset; companies should guard critical outputs while experimenting with hallucination-driven ideation in low-risk workflows.

Key finding

Hallucinations are usually split into factuality (wrong facts) and faithfulness (mismatch with instructions or context).

Prevent hallucinations by checking whether the model 'knows' concepts before answering

0.60

0.70

0.40

10

SELF-FAMILIARITY can reduce incorrect or fabricated outputs by blocking low-familiarity prompts before generation, improving customer trust and reducing downstream fact-checking costs.

Key finding

SELF-FAMILIARITY outperforms baselines on hallucinatory-instruction classification.

Numbers: Vicuna AUC=0.927 vs best baseline 0.872 (Table 2)

AttnLRP: a faithful, efficient LRP variant that attributes attention and latent neurons in transformers

0.70

0.60

0.70

9

AttnLRP gives faster, more faithful explanations for transformer decisions, lowering debugging cost and energy compared to perturbation; it also exposes neurons you can target to reduce hallucinations or bias.

Key finding

AttnLRP yields higher faithfulness than prior LRP variants on next‑token/classification perturbation tests.

Numbers: Wikipedia perturbation area: AttnLRP 10.93 vs CP‑LRP 7.85 (∆=+3.08)

FEWL: score and reduce LLM hallucination using other LLMs instead of human gold labels

0.60

0.55

0.80

8

FEWL lets teams detect and reduce hallucination cheaply when human gold labels are unavailable, cutting annotation cost and speeding up iteration on safety and quality.

Key finding

FEWL gives more accurate hallucination scores than simple baselines on CHALE.

Numbers: FEWL: 70.36% vs best baseline ~68.95% (Non-hallu vs Hallu on CHALE)

New, harder medical QA datasets (JAMA Clinical Challenge, Medbullets) expose limits of LLMs for clinical reasoning and explanations

0.30

0.40

0.25

8

If you plan to deploy LLMs in clinical workflows, expect lower accuracy and shaky explanations on realistic, complex cases; include clinician review and dataset-specific testing before adoption.

Key finding

The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).

Numbers: JAMA=1,524; Medbullets=308; Table 1

RAGBench: 100k explainable RAG examples plus TRACe — practical metrics to audit retriever+generator systems

0.70

0.55

0.60

8

RAGBench + TRACe gives a unified, explainable way to audit retriever and generator components, reducing costly trial-and-error and surfacing whether errors come from the retriever, the generator, or both.

Key finding

RAGBench totals approximately 100k labeled RAG examples.

Numbers: 100k total; Train 78k / Val 12k / Test 11k

Large-scale tests show where hallucinations come from, when common fixes help, and when they backfire

0.60

0.50

0.60

7

Hallucinations cause real-world harm (wrong facts, bad decisions). The paper gives practical, tested levers—retrieve relevant docs, apply RLHF, tune instruction mix, and be careful with quantization and aggressive sampling—so teams can reduce factual errors quickly.

Key finding

The GPT-4 based two-step detector (fact extraction + fact judgement) matches human labels at high rates.

Numbers: Agreement 91.5%–94.7% across five domains