Overview
DiaHalu is ready for evaluation and analysis but not designed as a training split; use it to audit chat systems and benchmark detectors rather than to train production models.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/8
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
If you deploy chatbots, about half of multi-turn sessions may contain hallucinations; DiaHalu helps quantify and reproduce those failures so you can prioritize fixes for knowledge and reasoning flows.
Who Should Care
Summary TLDR
DiaHalu is a new dialogue-level benchmark for hallucination in large language models (LLMs). It contains 1,103 multi-turn dialogues (avg 6.91 rounds) across four domains (knowledge-grounded, task-oriented, chit-chat, reasoning) and five hallucination subtypes (Non-factual, Incoherence, Irrelevance, Overreliance, Reasoning Error). Samples were generated by ChatGPT3.5 and GPT4, manually cleaned, and labeled by trained annotators (Fleiss' Kappa 0.8842). Overall 43.16% of dialogues contain at least one hallucination. Existing detectors and many LLMs struggle: GPT-4 reaches ~50.1% F1 on binary detection while most models are far lower, showing this dataset is challenging for dialogue-level faith/
Problem Statement
Current hallucination benchmarks are often sentence- or passage-level, hand-triggered, and focus on factual errors. Real chat systems produce multi-turn, context-dependent errors (including faithfulness problems like incoherence, irrelevance, overreliance) that existing datasets and detectors miss. DiaHalu fills this gap by providing naturally generated, annotated multi-turn dialogues covering these problems.
Main Contribution
A dialogue-level hallucination benchmark (DiaHalu) with 1,103 multi-turn dialogues for LLM evaluation.
Coverage of four dialogue domains and five hallucination subtypes, including faithfulness types rarely covered before.
Key Findings
Dataset size and structure: 1,103 multi-turn dialogue samples with average 6.912 rounds.
High incidence of hallucination: 43.16% of dialogues contain at least one hallucination.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 1,103 dialogues | — | — | — | Total samples produced by ChatGPT3.5 and GPT4 and post-processed (Sec. 4.2, A.5) | Table 2 |
| Average dialogue rounds | 6.9120 rounds | — | — | — | Mean rounds per dialogue (Table 2) | Table 2 |
What To Try In 7 Days
Run DiaHalu against your chatbot to get a dialogue-level error profile.
Test retrieval augmentation (search/RAG) on knowledge and reasoning dialogues.
Add few-shot or chain-of-thought prompts for your detector and compare F1 gains using DiaHalu.
Reproducibility
Risks & Boundaries
Limitations
Manual alignment of one speaker's turns is time-consuming and costly; generating human-like dialogue required repeated LLM calls.
Dataset is not split into train/validation/test; authors aimed for an evaluation benchmark, not a supervised training split.
When Not To Use
If you need a train/validation/test split for supervised learning.
If you only evaluate single-sentence factuality (sentence- or passage-level tasks).
Failure Modes
Hallucination snowballing: early errors propagate and amplify across rounds.
Judge bias: closed-source LLMs (e.g., ChatGPT3.5) may be overconfident and label many samples 'non-hallucinated'.

