DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

March 1, 20247 min

Overview

Decision SnapshotNeeds Validation

DiaHalu is ready for evaluation and analysis but not designed as a training split; use it to audit chat systems and benchmark detectors rather than to train production models.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/8

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Kedi Chen, Qin Chen, Jie Zhou, Yishen He, Liang He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy chatbots, about half of multi-turn sessions may contain hallucinations; DiaHalu helps quantify and reproduce those failures so you can prioritize fixes for knowledge and reasoning flows.

Who Should Care

Summary TLDR

DiaHalu is a new dialogue-level benchmark for hallucination in large language models (LLMs). It contains 1,103 multi-turn dialogues (avg 6.91 rounds) across four domains (knowledge-grounded, task-oriented, chit-chat, reasoning) and five hallucination subtypes (Non-factual, Incoherence, Irrelevance, Overreliance, Reasoning Error). Samples were generated by ChatGPT3.5 and GPT4, manually cleaned, and labeled by trained annotators (Fleiss' Kappa 0.8842). Overall 43.16% of dialogues contain at least one hallucination. Existing detectors and many LLMs struggle: GPT-4 reaches ~50.1% F1 on binary detection while most models are far lower, showing this dataset is challenging for dialogue-level faith/

Problem Statement

Current hallucination benchmarks are often sentence- or passage-level, hand-triggered, and focus on factual errors. Real chat systems produce multi-turn, context-dependent errors (including faithfulness problems like incoherence, irrelevance, overreliance) that existing datasets and detectors miss. DiaHalu fills this gap by providing naturally generated, annotated multi-turn dialogues covering these problems.

Main Contribution

A dialogue-level hallucination benchmark (DiaHalu) with 1,103 multi-turn dialogues for LLM evaluation.

Coverage of four dialogue domains and five hallucination subtypes, including faithfulness types rarely covered before.

Key Findings

Dataset size and structure: 1,103 multi-turn dialogue samples with average 6.912 rounds.

Numbers1,103 samples; avg rounds 6.9120 (Table 2)

Practical UseUse DiaHalu to stress-test multi-turn behavior, not single-turn factuality.

Evidence RefTable 2

High incidence of hallucination: 43.16% of dialogues contain at least one hallucination.

Numbers476/1103 = 43.16% (Table 3)

Practical UseExpect frequent hallucination in multi-turn chat; plan detection or mitigation for nearly half of sessions.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size1,103 dialoguesTotal samples produced by ChatGPT3.5 and GPT4 and post-processed (Sec. 4.2, A.5)Table 2
Average dialogue rounds6.9120 roundsMean rounds per dialogue (Table 2)Table 2

What To Try In 7 Days

Run DiaHalu against your chatbot to get a dialogue-level error profile.

Test retrieval augmentation (search/RAG) on knowledge and reasoning dialogues.

Add few-shot or chain-of-thought prompts for your detector and compare F1 gains using DiaHalu.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Manual alignment of one speaker's turns is time-consuming and costly; generating human-like dialogue required repeated LLM calls.

Dataset is not split into train/validation/test; authors aimed for an evaluation benchmark, not a supervised training split.

When Not To Use

If you need a train/validation/test split for supervised learning.

If you only evaluate single-sentence factuality (sentence- or passage-level tasks).

Failure Modes

Hallucination snowballing: early errors propagate and amplify across rounds.

Judge bias: closed-source LLMs (e.g., ChatGPT3.5) may be overconfident and label many samples 'non-hallucinated'.

Core Entities

Models

ChatGPT3.5GPT4Gemini1.5 PROLLaMa-30BVicuna-33B

Metrics

PrecisionRecallF1micro-F1Fleiss's Kappa

Datasets

TruthfulQACommonsenseQACWQMultiWOZ (2.1)DSTCGSM8KMathQA

Benchmarks

FactCollectBEGINHADESFactCHDHaluEvalWikiBio+PHD