DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

March 1, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Kedi Chen, Qin Chen, Jie Zhou, Yishen He, Liang He

Links

Abstract / PDF

Why It Matters For Business

If you deploy chatbots, about half of multi-turn sessions may contain hallucinations; DiaHalu helps quantify and reproduce those failures so you can prioritize fixes for knowledge and reasoning flows.

Summary TLDR

DiaHalu is a new dialogue-level benchmark for hallucination in large language models (LLMs). It contains 1,103 multi-turn dialogues (avg 6.91 rounds) across four domains (knowledge-grounded, task-oriented, chit-chat, reasoning) and five hallucination subtypes (Non-factual, Incoherence, Irrelevance, Overreliance, Reasoning Error). Samples were generated by ChatGPT3.5 and GPT4, manually cleaned, and labeled by trained annotators (Fleiss' Kappa 0.8842). Overall 43.16% of dialogues contain at least one hallucination. Existing detectors and many LLMs struggle: GPT-4 reaches ~50.1% F1 on binary detection while most models are far lower, showing this dataset is challenging for dialogue-level faith/

Problem Statement

Current hallucination benchmarks are often sentence- or passage-level, hand-triggered, and focus on factual errors. Real chat systems produce multi-turn, context-dependent errors (including faithfulness problems like incoherence, irrelevance, overreliance) that existing datasets and detectors miss. DiaHalu fills this gap by providing naturally generated, annotated multi-turn dialogues covering these problems.

Main Contribution

A dialogue-level hallucination benchmark (DiaHalu) with 1,103 multi-turn dialogues for LLM evaluation.

Coverage of four dialogue domains and five hallucination subtypes, including faithfulness types rarely covered before.

Human-annotated labels with explanations and high inter-annotator agreement (Fleiss' Kappa = 0.8842); baseline detection results show the benchmark is challenging.

Key Findings

Dataset size and structure: 1,103 multi-turn dialogue samples with average 6.912 rounds.

Numbers1,103 samples; avg rounds 6.9120 (Table 2)

High incidence of hallucination: 43.16% of dialogues contain at least one hallucination.

Numbers476/1103 = 43.16% (Table 3)

Domain differences: reasoning and knowledge dialogues show the highest hallucination rates.

NumbersReasoning 50.19%; Knowledge 46.36% (Table 3)

Annotator quality: labels are consistent across experts.

NumbersFleiss's Kappa = 0.8842 (Appendix A.6)

Detection is hard: even strong LLMs and detectors have low F1 scores.

NumbersGPT-4 overall F1 = 50.14; Gemini1.5 PRO F1 = 47.26; many models < 20 (Table 4)

Hallucination snowballs across rounds: previous hallucinations often reappear and grow.

NumbersCategory I (Halu&Halu) > others and increases with rounds (Figure 5)

Results

Dataset size

Value1,103 dialogues

Average dialogue rounds

Value6.9120 rounds

Overall hallucination rate

Value43.16%

Domain hallucination rates

ValueKnowledge 46.36%; Task 35.71%; Chit 37.64%; Reasoning 50.19%

Annotation agreement

ValueFleiss' Kappa = 0.8842

Best binary detection (overall F1)

ValueGPT-4 F1 = 50.14%

Representative baseline (Gemini1.5 PRO overall F1)

Value47.26%

Detector behavior: ChatGPT3.5

ValueOverconfident; F1 = 6.27%

Who Should Care

What To Try In 7 Days

Run DiaHalu against your chatbot to get a dialogue-level error profile.

Test retrieval augmentation (search/RAG) on knowledge and reasoning dialogues.

Add few-shot or chain-of-thought prompts for your detector and compare F1 gains using DiaHalu.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Manual alignment of one speaker's turns is time-consuming and costly; generating human-like dialogue required repeated LLM calls.
  • Dataset is not split into train/validation/test; authors aimed for an evaluation benchmark, not a supervised training split.

When Not To Use

  • If you need a train/validation/test split for supervised learning.
  • If you only evaluate single-sentence factuality (sentence- or passage-level tasks).
  • For domains outside the four covered types without additional domain-specific data.

Failure Modes

  • Hallucination snowballing: early errors propagate and amplify across rounds.
  • Judge bias: closed-source LLMs (e.g., ChatGPT3.5) may be overconfident and label many samples 'non-hallucinated'.
  • Domain skew: reasoning and knowledge domains dominate hallucination rates, so overall results may under-represent chit-chat styles.

Core Entities

Models

  • ChatGPT3.5
  • GPT4
  • Gemini1.5 PRO
  • LLaMa-30B
  • Vicuna-33B

Metrics

  • Precision
  • Recall
  • F1
  • micro-F1
  • Fleiss's Kappa

Datasets

  • TruthfulQA
  • CommonsenseQA
  • CWQ
  • MultiWOZ (2.1)
  • DSTC
  • GSM8K
  • MathQA

Benchmarks

  • FactCollect
  • BEGIN
  • HADES
  • FactCHD
  • HaluEval
  • WikiBio+
  • PHD