LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Overview

Decision SnapshotNeeds Validation

The paper gives solid empirical signals and a reusable dataset, but results depend heavily on retrieval coverage and KB quality, so pilot before production.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 45%

Authors

Siya Qi, Rui Cao, Yulan He, Zheng Yuan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated LLM judges misclassify mixed-source hallucinations; adding targeted retrieval and reflection reduces errors and makes automated quality checks more reliable.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The authors build FHSumBench (1,336 balanced summarization examples with injected entity facts) to test whether LLMs can tell apart factual hallucinations (true facts not supported by the document) and non-factual hallucinations (false facts). They compare direct LLM judges, prompting tricks (ICL, CoT), and three retrieval strategies. Main findings: retrieval-based methods and reflective retrieval reduce errors from the model's internal knowledge, but detecting factual hallucinations is still the main bottleneck. Model scaling helps sometimes but does not guarantee steady gains.

Problem Statement

Real-world hallucinations mix two contexts: the source document (faithfulness) and world knowledge (factuality). Existing evaluation work treats these separately. We need a balanced, scalable benchmark and an analysis of whether LLMs can reliably judge mixed-context hallucinations in summarization.

Main Contribution

FHSumBench: an automated, balanced benchmark of 1,336 summary examples with injected factual and non-factual entity facts.

Systematic comparison of direct LLM judging, prompting (ICL, CoT), and three retrieval strategies (knowledge, concurrent, reflection) for mixed-context hallucination detection.

Key Findings

FHSumBench size and balance

Numbers1,336 samples; roughly equal across factual, non-factual, no-hallucination

Practical UseUse this balanced set for stress-testing LLM judges; it highlights weaknesses obscured by skewed datasets.

Evidence Ref§3, Table 4

Retrieval coverage limits factual checks

NumbersFactScore found evidence for 55.2% (FHSumBench) and 45% (M-XSum) of entities

Practical UseIf your retriever misses ~45–55% of entities, factuality judgments will be unreliable; measure coverage before trusting judgments.

Evidence Ref§5.3, C.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FHSumBench size	1,336 samples (balanced across 3 labels)	—	—	FHSumBench	§3, Table 4	§3
Direct judge best F1	0.4733	vanilla judge	≈+0.15 vs random baseline F1≈0.3185	FHSumBench	Table 1 (Qwen2.5-14B +CoT)	Table 1

What To Try In 7 Days

Run your summarization outputs through a retrieval-based judge and compare outcomes to your current automatic checks.

Measure entity retrieval coverage on your domain; if <70%, add specialized KBs or generated descriptions before judging.

Build a small balanced validation set (factual, non-factual, no-hallucination) and test prompting (ICL/CoT) vs reflection retrieval.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/cece00/FHSumBench

Data URLs

https://github.com/cece00/FHSumBench

Risks & Boundaries

Limitations

Constructed hallucinations focus on entity appositive injections; discourse- and event-level hallucinations are not covered.

Evaluation and dataset are English-only (news) and may not generalize cross-lingually or to other domains.

When Not To Use

When you need discourse-level hallucination checks rather than entity-level ones.

When your domain lacks good entity coverage in KBs or generated descriptions.

Failure Modes

Intrinsic knowledge bias: judge ignores external evidence and trusts internal model facts.

Retrieval noise / poor coverage: missing or noisy evidence leads to wrong factuality labels.

Core Entities

Models

Llama3-8BQwen2.5-14BQwen2.5-32BQwen2.5-72BGPT-4oQwen familyLlama family

Metrics

precisionrecallF1FH-AccNFH-AccNoH-Acc

Datasets

FHSumBenchM-XSumXEntFactCollectXSumCNN/DM

Benchmarks

FHSumBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FHSumBench size and balance

Retrieval coverage limits factual checks

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding