AuthenHallu — a hallucination benchmark built from real human-LLM chats

Overview

Decision SnapshotNeeds Validation

AuthenHallu fills a clear gap by using real human-LLM chats (novel). The dataset is small (800 pairs) and English-only, which limits generalization. Experimental evidence (tables) supports the main claims.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 20%

Novelty: 70%

Authors

Yujie Ren, Niklas Gruhlke, Anne Lauscher

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Real user conversations show hallucinations are frequent and concentrated in specific topics (numbers, dates). Off-the-shelf LLMs miss many errors. Businesses should not assume benchmark performance carries over to live usage.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

AuthenHallu is a new benchmark of 400 real LLM-human dialogues (800 query-response pairs) hand-labeled for hallucinations. The dataset shows 31.4% of pairs contain hallucinations (fact-conflicting are most common). Off-the-shelf LLMs used zero-shot as detectors reach F1s around 50–64% and fail on many faithfulness cases. The dataset and code are public.

Problem Statement

Existing hallucination benchmarks rely on induced or simulated examples that do not match how users actually interact with LLMs. That gap risks overestimating detector performance in real-world use.

Main Contribution

AuthenHallu: a hallucination detection benchmark built entirely from authentic LLM-human dialogues (400 dialogues, 800 pairs).

Statistical analysis of hallucination types and topic-specific rates in authentic interactions.

Key Findings

Hallucinations are common in real interactions.

Numbers251 / 800 query-response pairs = 31.4%

Practical UseExpect ~1 in 3 LLM replies to be problematic on real user data; include detection or human review in production.

Evidence RefSection 3.2; Table 2

Some topics are far worse than average.

NumbersMath & Number Problems: 60% hallucination rate

Practical UseAdd stricter checks or tool-based computation for numeric and temporal queries.

Evidence RefFigure 2; Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hallucination prevalence (pairs)	31.4%	—	—	AuthenHallu (800 pairs)	251 hallucinated pairs out of 800	Table 2; Section 3.2
Hallucination prevalence (dialogues)	40.8%	—	—	AuthenHallu (400 dialogues)	163 hallucinated dialogues out of 400	Table 2; Section 3.2

What To Try In 7 Days

Run a small sample of your user queries through AuthenHallu to compare error patterns.

Add strict checks for numeric and time-related replies or route them to tools (calculators, calendars).

Combine automated detectors with human review in high-risk paths, focusing on recall.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/TAI-HAMBURG/AuthenHallu

Data URLs

LMSYS-Chat-1M (source dataset referenced in paper)AuthenHallu (released via project GitHub)

Risks & Boundaries

Limitations

Manual labels are hard and show only moderate inter-annotator agreement (Kappa=0.591).

Dataset is English-only and small (800 pairs), limiting coverage and statistical power.

When Not To Use

Do not use AuthenHallu as the sole evidence for production readiness in high-stakes domains.

Do not assume results generalize to non-English usage.

Failure Modes

Annotation noise can lead to false negatives/positives in detector evaluation.

Topic imbalance may bias perceived detector strengths (overfitting to frequent topics).

Core Entities

Models

Mistral-7B-Instruct-v0.3Gemma-3-27B-ITQwen-2.5-7B-InstructQwen-3-32BLlama-3.1-8B-InstructLlama-3.3-70B-Instruct

Metrics

precisionrecallF1-score

Datasets

AuthenHalluLMSYS-Chat-1M

Benchmarks

HaluEvalDiaHaluPHDFELMWildBenchWildHallucinations

Context Entities

Models

vicuna-13bkoala-13balpaca-13bgpt-4gpt-3.5-turboclaude-1claude2

Metrics

Fleiss' Kappapairwise F1

Datasets

LMSYS-Chat-1M

Benchmarks

HaluEvalWildHallucinations

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hallucinations are common in real interactions.

Some topics are far worse than average.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding