LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

March 3, 20257 min

Overview

Production Readiness

0.5

Novelty Score

0.45

Cost Impact Score

0.5

Citation Count

0

Authors

Siya Qi, Rui Cao, Yulan He, Zheng Yuan

Links

Abstract / PDF

Why It Matters For Business

Automated LLM judges misclassify mixed-source hallucinations; adding targeted retrieval and reflection reduces errors and makes automated quality checks more reliable.

Summary TLDR

The authors build FHSumBench (1,336 balanced summarization examples with injected entity facts) to test whether LLMs can tell apart factual hallucinations (true facts not supported by the document) and non-factual hallucinations (false facts). They compare direct LLM judges, prompting tricks (ICL, CoT), and three retrieval strategies. Main findings: retrieval-based methods and reflective retrieval reduce errors from the model's internal knowledge, but detecting factual hallucinations is still the main bottleneck. Model scaling helps sometimes but does not guarantee steady gains.

Problem Statement

Real-world hallucinations mix two contexts: the source document (faithfulness) and world knowledge (factuality). Existing evaluation work treats these separately. We need a balanced, scalable benchmark and an analysis of whether LLMs can reliably judge mixed-context hallucinations in summarization.

Main Contribution

FHSumBench: an automated, balanced benchmark of 1,336 summary examples with injected factual and non-factual entity facts.

Systematic comparison of direct LLM judging, prompting (ICL, CoT), and three retrieval strategies (knowledge, concurrent, reflection) for mixed-context hallucination detection.

Empirical diagnosis: intrinsic LLM knowledge biases judges and factual hallucinations are the hardest category; retrieval and reflection help but require good retrieval coverage.

Key Findings

FHSumBench size and balance

Numbers1,336 samples; roughly equal across factual, non-factual, no-hallucination

Retrieval coverage limits factual checks

NumbersFactScore found evidence for 55.2% (FHSumBench) and 45% (M-XSum) of entities

Intrinsic knowledge causes misclassification

Numbers63.3% of GPT-4o CoT error cases relied on intrinsic knowledge

Retrieval corrects many judge errors

NumbersConcurrent retrieval corrected 47.3% and reflection retrieval corrected 57.9% of 'no-hallucination' errors in case study

Best direct/retrieval F1 scores (representative)

NumbersBest direct judge F1 ≈ 0.4733 (Qwen2.5-14B +CoT); retrieval F1 ≈ 0.4715 (GPT-4o, Table 2)

Model scaling is not strictly monotonic

NumbersQwen2.5-32B often improves, but Qwen2.5-72B and GPT-4o do not always outperform smaller sizes

Results

FHSumBench size

Value1,336 samples (balanced across 3 labels)

Direct judge best F1

Value0.4733

Baselinevanilla judge

Reflection retrieval F1 (representative)

Value0.4640

Baselineknowledge retrieval

Entity retrieval coverage (FactScore)

Value55.2% (FHSumBench); 45% (M-XSum)

GPT-4o CoT error cause rate

Value63.3% errors due to intrinsic knowledge reliance

Retrieval correction rates (case study)

ValueCR corrected 47.3%; RR corrected 57.9%

Baselinedirect judge errors

Who Should Care

What To Try In 7 Days

Run your summarization outputs through a retrieval-based judge and compare outcomes to your current automatic checks.

Measure entity retrieval coverage on your domain; if <70%, add specialized KBs or generated descriptions before judging.

Build a small balanced validation set (factual, non-factual, no-hallucination) and test prompting (ICL/CoT) vs reflection retrieval.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Constructed hallucinations focus on entity appositive injections; discourse- and event-level hallucinations are not covered.
  • Evaluation and dataset are English-only (news) and may not generalize cross-lingually or to other domains.
  • Benchmark structure could be gamed by targeted training on appositive patterns (benchmark hacking risk).
  • Retrieval coverage is a practical bottleneck: public KBs miss many entities.

When Not To Use

  • When you need discourse-level hallucination checks rather than entity-level ones.
  • When your domain lacks good entity coverage in KBs or generated descriptions.
  • When you need language or cultural coverage beyond English news.

Failure Modes

  • Intrinsic knowledge bias: judge ignores external evidence and trusts internal model facts.
  • Retrieval noise / poor coverage: missing or noisy evidence leads to wrong factuality labels.
  • Overcorrection during iterative reflection: model flips correct judgements after extra queries.
  • Benchmark overfitting: training to detect appositive injections rather than true mixed-context errors.

Core Entities

Models

  • Llama3-8B
  • Qwen2.5-14B
  • Qwen2.5-32B
  • Qwen2.5-72B
  • GPT-4o
  • Qwen family
  • Llama family

Metrics

  • precision
  • recall
  • F1
  • FH-Acc
  • NFH-Acc
  • NoH-Acc

Datasets

  • FHSumBench
  • M-XSum
  • XEnt
  • FactCollect
  • XSum
  • CNN/DM

Benchmarks

  • FHSumBench