AuthenHallu — a hallucination benchmark built from real human-LLM chats

October 12, 20257 min

Overview

Decision SnapshotNeeds Validation

AuthenHallu fills a clear gap by using real human-LLM chats (novel). The dataset is small (800 pairs) and English-only, which limits generalization. Experimental evidence (tables) supports the main claims.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 20%

Novelty: 70%

Authors

Yujie Ren, Niklas Gruhlke, Anne Lauscher

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Real user conversations show hallucinations are frequent and concentrated in specific topics (numbers, dates). Off-the-shelf LLMs miss many errors. Businesses should not assume benchmark performance carries over to live usage.

Who Should Care

Summary TLDR

AuthenHallu is a new benchmark of 400 real LLM-human dialogues (800 query-response pairs) hand-labeled for hallucinations. The dataset shows 31.4% of pairs contain hallucinations (fact-conflicting are most common). Off-the-shelf LLMs used zero-shot as detectors reach F1s around 50–64% and fail on many faithfulness cases. The dataset and code are public.

Problem Statement

Existing hallucination benchmarks rely on induced or simulated examples that do not match how users actually interact with LLMs. That gap risks overestimating detector performance in real-world use.

Main Contribution

AuthenHallu: a hallucination detection benchmark built entirely from authentic LLM-human dialogues (400 dialogues, 800 pairs).

Statistical analysis of hallucination types and topic-specific rates in authentic interactions.

Key Findings

Hallucinations are common in real interactions.

Numbers251 / 800 query-response pairs = 31.4%

Practical UseExpect ~1 in 3 LLM replies to be problematic on real user data; include detection or human review in production.

Evidence RefSection 3.2; Table 2

Some topics are far worse than average.

NumbersMath & Number Problems: 60% hallucination rate

Practical UseAdd stricter checks or tool-based computation for numeric and temporal queries.

Evidence RefFigure 2; Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hallucination prevalence (pairs)31.4%AuthenHallu (800 pairs)251 hallucinated pairs out of 800Table 2; Section 3.2
Hallucination prevalence (dialogues)40.8%AuthenHallu (400 dialogues)163 hallucinated dialogues out of 400Table 2; Section 3.2

What To Try In 7 Days

Run a small sample of your user queries through AuthenHallu to compare error patterns.

Add strict checks for numeric and time-related replies or route them to tools (calculators, calendars).

Combine automated detectors with human review in high-risk paths, focusing on recall.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

LMSYS-Chat-1M (source dataset referenced in paper)AuthenHallu (released via project GitHub)

Risks & Boundaries

Limitations

Manual labels are hard and show only moderate inter-annotator agreement (Kappa=0.591).

Dataset is English-only and small (800 pairs), limiting coverage and statistical power.

When Not To Use

Do not use AuthenHallu as the sole evidence for production readiness in high-stakes domains.

Do not assume results generalize to non-English usage.

Failure Modes

Annotation noise can lead to false negatives/positives in detector evaluation.

Topic imbalance may bias perceived detector strengths (overfitting to frequent topics).

Core Entities

Models

Mistral-7B-Instruct-v0.3Gemma-3-27B-ITQwen-2.5-7B-InstructQwen-3-32BLlama-3.1-8B-InstructLlama-3.3-70B-Instruct

Metrics

precisionrecallF1-score

Datasets

AuthenHalluLMSYS-Chat-1M

Benchmarks

HaluEvalDiaHaluPHDFELMWildBenchWildHallucinations

Context Entities

Models

vicuna-13bkoala-13balpaca-13bgpt-4gpt-3.5-turboclaude-1claude2

Metrics

Fleiss' Kappapairwise F1

Datasets

LMSYS-Chat-1M

Benchmarks

HaluEvalWildHallucinations