DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Overview

Decision SnapshotNeeds Validation

DiaHalu is ready for evaluation and analysis but not designed as a training split; use it to audit chat systems and benchmark detectors rather than to train production models.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/8

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Kedi Chen, Qin Chen, Jie Zhou, Yishen He, Liang He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy chatbots, about half of multi-turn sessions may contain hallucinations; DiaHalu helps quantify and reproduce those failures so you can prioritize fixes for knowledge and reasoning flows.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

DiaHalu is a new dialogue-level benchmark for hallucination in large language models (LLMs). It contains 1,103 multi-turn dialogues (avg 6.91 rounds) across four domains (knowledge-grounded, task-oriented, chit-chat, reasoning) and five hallucination subtypes (Non-factual, Incoherence, Irrelevance, Overreliance, Reasoning Error). Samples were generated by ChatGPT3.5 and GPT4, manually cleaned, and labeled by trained annotators (Fleiss' Kappa 0.8842). Overall 43.16% of dialogues contain at least one hallucination. Existing detectors and many LLMs struggle: GPT-4 reaches ~50.1% F1 on binary detection while most models are far lower, showing this dataset is challenging for dialogue-level faith/

Problem Statement

Current hallucination benchmarks are often sentence- or passage-level, hand-triggered, and focus on factual errors. Real chat systems produce multi-turn, context-dependent errors (including faithfulness problems like incoherence, irrelevance, overreliance) that existing datasets and detectors miss. DiaHalu fills this gap by providing naturally generated, annotated multi-turn dialogues covering these problems.

Main Contribution

A dialogue-level hallucination benchmark (DiaHalu) with 1,103 multi-turn dialogues for LLM evaluation.

Coverage of four dialogue domains and five hallucination subtypes, including faithfulness types rarely covered before.

Key Findings

Dataset size and structure: 1,103 multi-turn dialogue samples with average 6.912 rounds.

Numbers1,103 samples; avg rounds 6.9120 (Table 2)

Practical UseUse DiaHalu to stress-test multi-turn behavior, not single-turn factuality.

Evidence RefTable 2

High incidence of hallucination: 43.16% of dialogues contain at least one hallucination.

Numbers476/1103 = 43.16% (Table 3)

Practical UseExpect frequent hallucination in multi-turn chat; plan detection or mitigation for nearly half of sessions.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	1,103 dialogues	—	—	—	Total samples produced by ChatGPT3.5 and GPT4 and post-processed (Sec. 4.2, A.5)	Table 2
Average dialogue rounds	6.9120 rounds	—	—	—	Mean rounds per dialogue (Table 2)	Table 2

What To Try In 7 Days

Run DiaHalu against your chatbot to get a dialogue-level error profile.

Test retrieval augmentation (search/RAG) on knowledge and reasoning dialogues.

Add few-shot or chain-of-thought prompts for your detector and compare F1 gains using DiaHalu.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/ECNU-ICALK/DiaHalu

Data URLs

https://github.com/ECNU-ICALK/DiaHalu

Risks & Boundaries

Limitations

Manual alignment of one speaker's turns is time-consuming and costly; generating human-like dialogue required repeated LLM calls.

Dataset is not split into train/validation/test; authors aimed for an evaluation benchmark, not a supervised training split.

When Not To Use

If you need a train/validation/test split for supervised learning.

If you only evaluate single-sentence factuality (sentence- or passage-level tasks).

Failure Modes

Hallucination snowballing: early errors propagate and amplify across rounds.

Judge bias: closed-source LLMs (e.g., ChatGPT3.5) may be overconfident and label many samples 'non-hallucinated'.

Core Entities

Models

ChatGPT3.5GPT4Gemini1.5 PROLLaMa-30BVicuna-33B

Metrics

PrecisionRecallF1micro-F1Fleiss's Kappa

Datasets

TruthfulQACommonsenseQACWQMultiWOZ (2.1)DSTCGSM8KMathQA

Benchmarks

FactCollectBEGINHADESFactCHDHaluEvalWikiBio+PHD

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and structure: 1,103 multi-turn dialogue samples with average 6.912 rounds.

High incidence of hallucination: 43.16% of dialogues contain at least one hallucination.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding