LLM 'hallucinations' are narrative-rich confabulations that can improve coherence and may be useful

June 6, 20246 min

Overview

Decision SnapshotNeeds Validation

The paper shows consistent statistical associations across three public benchmarks using automatic metrics, but offers no human user studies and does not release tooling, so results are promising but preliminary.

Citations11

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 1/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Peiqi Sui, Eamon Duede, Sophie Wu, Richard Jean So

Links

Abstract / PDF

Why It Matters For Business

Hallucinations often produce more coherent, story-like text; that trait can be useful for product flows that prioritize readability, persuasion, or ideation, but it creates risk in truth-sensitive domains and needs human validation.

Who Should Care

Summary TLDR

The paper argues that many LLM 'hallucinations' are better described as confabulations — narrative-rich, coherent outputs that fill gaps with plausible details. Using an ELECTRA-large story detector across three dialog benchmarks (FaithDial, BEGIN, HaluEval), the authors show hallucinated responses score higher on narrativity than factual responses and that narrativity predicts hallucination labels (logistic coeff=0.631, p<0.01). Narrativity also correlates with automated dialogue coherence (beta coeff=0.372, p<0.01). The authors propose reframing hallucinations as a usable resource while warning that human studies and domain-specific safeguards are needed before adoption.

Problem Statement

Hallucinations in LLMs are usually treated as purely harmful. The paper asks whether these outputs instead express a narrative impulse (confabulation) that increases narrativity and coherence, and whether that property can be measured and potentially used rather than only suppressed.

Main Contribution

Operationalize narrativity as a scalar score using a fine-tuned ELECTRA-large story detector trained on an expert Reddit story dataset

Empirically show hallucinated dialog outputs have higher narrativity than truthful outputs across FaithDial, BEGIN, and HaluEval benchmarks

Key Findings

Hallucinated dialog responses have higher mean narrativity than truthful responses on evaluated benchmarks

NumbersFaithDial mean 0.620 vs truth 0.518≈0.102); HaluEval 0.655 vs 0.638≈0.017); BEGIN 0.658 vs 0.561≈0.097)

Practical UseYou can detect higher narrative content in model outputs; consider scoring narrativity when tuning or filtering outputs, especially for creative or explanatory tasks

Evidence RefTable 2

Higher narrativity significantly predicts an output being labeled a hallucination

NumbersLogistic regression coefficient for narrativity = 0.631 (std err 0.059), p < 0.01

Practical UseNarrativity is a usable feature for classifiers or filters that flag likely hallucinations in dialog systems

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Narrativity mean (hallucinated vs truth)FaithDial 0.620 vs 0.518; HaluEval 0.655 vs 0.638; BEGIN 0.658 vs 0.561Truthful responsesΔ≈0.102 / 0.017 / 0.097FaithDial, HaluEval, BEGINTable 2 reports means and counts for narrativity by labelTable 2
Predictive power of narrativity for hallucinationLogistic regression coeff = 0.631 (std err 0.059)No narrativity featureFaithDial + BEGIN aggregated (43,842 observations)Table 3 shows positive significant coefficient (p<0.01)Table 3

What To Try In 7 Days

Score existing model outputs with a narrativity detector to profile narrative intensity

A/B test high-narrativity vs low-narrativity outputs on user satisfaction for explanatory UI copy

Add narrativity as a feature in hallucination detectors and monitor flagged outputs

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Correlation, not causation: analyses show association between narrativity and coherence but do not prove narrativity causes coherence

No human-subject experiments: user benefits of narrative-rich confabulations are hypothesized but untested

When Not To Use

In truth-sensitive applications (medicine, law, finance) where factual accuracy is mandatory

When stakeholders require verifiable citations or provenance for assertions

Failure Modes

Confabulations that read well but are false, leading to persuasive misinformation

High narrativity masking factual errors and reducing detectability by users

Core Entities

Models

ELECTRA-largeRoBERTa-large (used in DEAM reference)

Metrics

Narrativity score (story-detection softmax)Coherence (DEAM)

Datasets

FaithDialBEGINHaluEvalWizard of Wikipedia (WoW) (source for FaithDial)

Benchmarks

FaithDialBEGINHaluEval