Overview
Production Readiness
0.5
Novelty Score
0.45
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Automated LLM judges misclassify mixed-source hallucinations; adding targeted retrieval and reflection reduces errors and makes automated quality checks more reliable.
Summary TLDR
The authors build FHSumBench (1,336 balanced summarization examples with injected entity facts) to test whether LLMs can tell apart factual hallucinations (true facts not supported by the document) and non-factual hallucinations (false facts). They compare direct LLM judges, prompting tricks (ICL, CoT), and three retrieval strategies. Main findings: retrieval-based methods and reflective retrieval reduce errors from the model's internal knowledge, but detecting factual hallucinations is still the main bottleneck. Model scaling helps sometimes but does not guarantee steady gains.
Problem Statement
Real-world hallucinations mix two contexts: the source document (faithfulness) and world knowledge (factuality). Existing evaluation work treats these separately. We need a balanced, scalable benchmark and an analysis of whether LLMs can reliably judge mixed-context hallucinations in summarization.
Main Contribution
FHSumBench: an automated, balanced benchmark of 1,336 summary examples with injected factual and non-factual entity facts.
Systematic comparison of direct LLM judging, prompting (ICL, CoT), and three retrieval strategies (knowledge, concurrent, reflection) for mixed-context hallucination detection.
Empirical diagnosis: intrinsic LLM knowledge biases judges and factual hallucinations are the hardest category; retrieval and reflection help but require good retrieval coverage.
Key Findings
FHSumBench size and balance
Retrieval coverage limits factual checks
Intrinsic knowledge causes misclassification
Retrieval corrects many judge errors
Best direct/retrieval F1 scores (representative)
Model scaling is not strictly monotonic
Results
FHSumBench size
Direct judge best F1
Reflection retrieval F1 (representative)
Entity retrieval coverage (FactScore)
GPT-4o CoT error cause rate
Retrieval correction rates (case study)
Who Should Care
What To Try In 7 Days
Run your summarization outputs through a retrieval-based judge and compare outcomes to your current automatic checks.
Measure entity retrieval coverage on your domain; if <70%, add specialized KBs or generated descriptions before judging.
Build a small balanced validation set (factual, non-factual, no-hallucination) and test prompting (ICL/CoT) vs reflection retrieval.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Constructed hallucinations focus on entity appositive injections; discourse- and event-level hallucinations are not covered.
- Evaluation and dataset are English-only (news) and may not generalize cross-lingually or to other domains.
- Benchmark structure could be gamed by targeted training on appositive patterns (benchmark hacking risk).
- Retrieval coverage is a practical bottleneck: public KBs miss many entities.
When Not To Use
- When you need discourse-level hallucination checks rather than entity-level ones.
- When your domain lacks good entity coverage in KBs or generated descriptions.
- When you need language or cultural coverage beyond English news.
Failure Modes
- Intrinsic knowledge bias: judge ignores external evidence and trusts internal model facts.
- Retrieval noise / poor coverage: missing or noisy evidence leads to wrong factuality labels.
- Overcorrection during iterative reflection: model flips correct judgements after extra queries.
- Benchmark overfitting: training to detect appositive injections rather than true mixed-context errors.
Core Entities
Models
- Llama3-8B
- Qwen2.5-14B
- Qwen2.5-32B
- Qwen2.5-72B
- GPT-4o
- Qwen family
- Llama family
Metrics
- precision
- recall
- F1
- FH-Acc
- NFH-Acc
- NoH-Acc
Datasets
- FHSumBench
- M-XSum
- XEnt
- FactCollect
- XSum
- CNN/DM
Benchmarks
- FHSumBench

