Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If you deploy chatbots, about half of multi-turn sessions may contain hallucinations; DiaHalu helps quantify and reproduce those failures so you can prioritize fixes for knowledge and reasoning flows.
Summary TLDR
DiaHalu is a new dialogue-level benchmark for hallucination in large language models (LLMs). It contains 1,103 multi-turn dialogues (avg 6.91 rounds) across four domains (knowledge-grounded, task-oriented, chit-chat, reasoning) and five hallucination subtypes (Non-factual, Incoherence, Irrelevance, Overreliance, Reasoning Error). Samples were generated by ChatGPT3.5 and GPT4, manually cleaned, and labeled by trained annotators (Fleiss' Kappa 0.8842). Overall 43.16% of dialogues contain at least one hallucination. Existing detectors and many LLMs struggle: GPT-4 reaches ~50.1% F1 on binary detection while most models are far lower, showing this dataset is challenging for dialogue-level faith/
Problem Statement
Current hallucination benchmarks are often sentence- or passage-level, hand-triggered, and focus on factual errors. Real chat systems produce multi-turn, context-dependent errors (including faithfulness problems like incoherence, irrelevance, overreliance) that existing datasets and detectors miss. DiaHalu fills this gap by providing naturally generated, annotated multi-turn dialogues covering these problems.
Main Contribution
A dialogue-level hallucination benchmark (DiaHalu) with 1,103 multi-turn dialogues for LLM evaluation.
Coverage of four dialogue domains and five hallucination subtypes, including faithfulness types rarely covered before.
Human-annotated labels with explanations and high inter-annotator agreement (Fleiss' Kappa = 0.8842); baseline detection results show the benchmark is challenging.
Key Findings
Dataset size and structure: 1,103 multi-turn dialogue samples with average 6.912 rounds.
High incidence of hallucination: 43.16% of dialogues contain at least one hallucination.
Domain differences: reasoning and knowledge dialogues show the highest hallucination rates.
Annotator quality: labels are consistent across experts.
Detection is hard: even strong LLMs and detectors have low F1 scores.
Hallucination snowballs across rounds: previous hallucinations often reappear and grow.
Results
Dataset size
Average dialogue rounds
Overall hallucination rate
Domain hallucination rates
Annotation agreement
Best binary detection (overall F1)
Representative baseline (Gemini1.5 PRO overall F1)
Detector behavior: ChatGPT3.5
Who Should Care
What To Try In 7 Days
Run DiaHalu against your chatbot to get a dialogue-level error profile.
Test retrieval augmentation (search/RAG) on knowledge and reasoning dialogues.
Add few-shot or chain-of-thought prompts for your detector and compare F1 gains using DiaHalu.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Manual alignment of one speaker's turns is time-consuming and costly; generating human-like dialogue required repeated LLM calls.
- Dataset is not split into train/validation/test; authors aimed for an evaluation benchmark, not a supervised training split.
When Not To Use
- If you need a train/validation/test split for supervised learning.
- If you only evaluate single-sentence factuality (sentence- or passage-level tasks).
- For domains outside the four covered types without additional domain-specific data.
Failure Modes
- Hallucination snowballing: early errors propagate and amplify across rounds.
- Judge bias: closed-source LLMs (e.g., ChatGPT3.5) may be overconfident and label many samples 'non-hallucinated'.
- Domain skew: reasoning and knowledge domains dominate hallucination rates, so overall results may under-represent chit-chat styles.
Core Entities
Models
- ChatGPT3.5
- GPT4
- Gemini1.5 PRO
- LLaMa-30B
- Vicuna-33B
Metrics
- Precision
- Recall
- F1
- micro-F1
- Fleiss's Kappa
Datasets
- TruthfulQA
- CommonsenseQA
- CWQ
- MultiWOZ (2.1)
- DSTC
- GSM8K
- MathQA
Benchmarks
- FactCollect
- BEGIN
- HADES
- FactCHD
- HaluEval
- WikiBio+
- PHD

