LLM 'hallucinations' are narrative-rich confabulations that can improve coherence and may be useful

Overview

Decision SnapshotNeeds Validation

The paper shows consistent statistical associations across three public benchmarks using automatic metrics, but offers no human user studies and does not release tooling, so results are promising but preliminary.

Citations11

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Peiqi Sui, Eamon Duede, Sophie Wu, Richard Jean So

Links

Abstract / PDF

Why It Matters For Business

Hallucinations often produce more coherent, story-like text; that trait can be useful for product flows that prioritize readability, persuasion, or ideation, but it creates risk in truth-sensitive domains and needs human validation.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

The paper argues that many LLM 'hallucinations' are better described as confabulations — narrative-rich, coherent outputs that fill gaps with plausible details. Using an ELECTRA-large story detector across three dialog benchmarks (FaithDial, BEGIN, HaluEval), the authors show hallucinated responses score higher on narrativity than factual responses and that narrativity predicts hallucination labels (logistic coeff=0.631, p<0.01). Narrativity also correlates with automated dialogue coherence (beta coeff=0.372, p<0.01). The authors propose reframing hallucinations as a usable resource while warning that human studies and domain-specific safeguards are needed before adoption.

Problem Statement

Hallucinations in LLMs are usually treated as purely harmful. The paper asks whether these outputs instead express a narrative impulse (confabulation) that increases narrativity and coherence, and whether that property can be measured and potentially used rather than only suppressed.

Main Contribution

Operationalize narrativity as a scalar score using a fine-tuned ELECTRA-large story detector trained on an expert Reddit story dataset

Empirically show hallucinated dialog outputs have higher narrativity than truthful outputs across FaithDial, BEGIN, and HaluEval benchmarks

Key Findings

Hallucinated dialog responses have higher mean narrativity than truthful responses on evaluated benchmarks

NumbersFaithDial mean 0.620 vs truth 0.518 (Δ≈0.102); HaluEval 0.655 vs 0.638 (Δ≈0.017); BEGIN 0.658 vs 0.561 (Δ≈0.097)

Practical UseYou can detect higher narrative content in model outputs; consider scoring narrativity when tuning or filtering outputs, especially for creative or explanatory tasks

Evidence RefTable 2

Higher narrativity significantly predicts an output being labeled a hallucination

NumbersLogistic regression coefficient for narrativity = 0.631 (std err 0.059), p < 0.01

Practical UseNarrativity is a usable feature for classifiers or filters that flag likely hallucinations in dialog systems

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Narrativity mean (hallucinated vs truth)	FaithDial 0.620 vs 0.518; HaluEval 0.655 vs 0.638; BEGIN 0.658 vs 0.561	Truthful responses	Δ≈0.102 / 0.017 / 0.097	FaithDial, HaluEval, BEGIN	Table 2 reports means and counts for narrativity by label	Table 2
Predictive power of narrativity for hallucination	Logistic regression coeff = 0.631 (std err 0.059)	No narrativity feature	—	FaithDial + BEGIN aggregated (43,842 observations)	Table 3 shows positive significant coefficient (p<0.01)	Table 3

What To Try In 7 Days

Score existing model outputs with a narrativity detector to profile narrative intensity

A/B test high-narrativity vs low-narrativity outputs on user satisfaction for explanatory UI copy

Add narrativity as a feature in hallucination detectors and monitor flagged outputs

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Correlation, not causation: analyses show association between narrativity and coherence but do not prove narrativity causes coherence

No human-subject experiments: user benefits of narrative-rich confabulations are hypothesized but untested

When Not To Use

In truth-sensitive applications (medicine, law, finance) where factual accuracy is mandatory

When stakeholders require verifiable citations or provenance for assertions

Failure Modes

Confabulations that read well but are false, leading to persuasive misinformation

High narrativity masking factual errors and reducing detectability by users

Core Entities

Models

ELECTRA-largeRoBERTa-large (used in DEAM reference)

Metrics

Narrativity score (story-detection softmax)Coherence (DEAM)

Datasets

FaithDialBEGINHaluEvalWizard of Wikipedia (WoW) (source for FaithDial)

Benchmarks

FaithDialBEGINHaluEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hallucinated dialog responses have higher mean narrativity than truthful responses on evaluated benchmarks

Higher narrativity significantly predicts an output being labeled a hallucination

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding