Overview
AuthenHallu fills a clear gap by using real human-LLM chats (novel). The dataset is small (800 pairs) and English-only, which limits generalization. Experimental evidence (tables) supports the main claims.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 30%
Production readiness: 20%
Novelty: 70%
Why It Matters For Business
Real user conversations show hallucinations are frequent and concentrated in specific topics (numbers, dates). Off-the-shelf LLMs miss many errors. Businesses should not assume benchmark performance carries over to live usage.
Who Should Care
Summary TLDR
AuthenHallu is a new benchmark of 400 real LLM-human dialogues (800 query-response pairs) hand-labeled for hallucinations. The dataset shows 31.4% of pairs contain hallucinations (fact-conflicting are most common). Off-the-shelf LLMs used zero-shot as detectors reach F1s around 50–64% and fail on many faithfulness cases. The dataset and code are public.
Problem Statement
Existing hallucination benchmarks rely on induced or simulated examples that do not match how users actually interact with LLMs. That gap risks overestimating detector performance in real-world use.
Main Contribution
AuthenHallu: a hallucination detection benchmark built entirely from authentic LLM-human dialogues (400 dialogues, 800 pairs).
Statistical analysis of hallucination types and topic-specific rates in authentic interactions.
Key Findings
Hallucinations are common in real interactions.
Some topics are far worse than average.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Hallucination prevalence (pairs) | 31.4% | — | — | AuthenHallu (800 pairs) | 251 hallucinated pairs out of 800 | Table 2; Section 3.2 |
| Hallucination prevalence (dialogues) | 40.8% | — | — | AuthenHallu (400 dialogues) | 163 hallucinated dialogues out of 400 | Table 2; Section 3.2 |
What To Try In 7 Days
Run a small sample of your user queries through AuthenHallu to compare error patterns.
Add strict checks for numeric and time-related replies or route them to tools (calculators, calendars).
Combine automated detectors with human review in high-risk paths, focusing on recall.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Manual labels are hard and show only moderate inter-annotator agreement (Kappa=0.591).
Dataset is English-only and small (800 pairs), limiting coverage and statistical power.
When Not To Use
Do not use AuthenHallu as the sole evidence for production readiness in high-stakes domains.
Do not assume results generalize to non-English usage.
Failure Modes
Annotation noise can lead to false negatives/positives in detector evaluation.
Topic imbalance may bias perceived detector strengths (overfitting to frequent topics).

