LLM 'hallucinations' are narrative-rich confabulations that can improve coherence and may be useful

June 6, 20246 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

11

Authors

Peiqi Sui, Eamon Duede, Sophie Wu, Richard Jean So

Links

Abstract / PDF

Why It Matters For Business

Hallucinations often produce more coherent, story-like text; that trait can be useful for product flows that prioritize readability, persuasion, or ideation, but it creates risk in truth-sensitive domains and needs human validation.

Summary TLDR

The paper argues that many LLM 'hallucinations' are better described as confabulations — narrative-rich, coherent outputs that fill gaps with plausible details. Using an ELECTRA-large story detector across three dialog benchmarks (FaithDial, BEGIN, HaluEval), the authors show hallucinated responses score higher on narrativity than factual responses and that narrativity predicts hallucination labels (logistic coeff=0.631, p<0.01). Narrativity also correlates with automated dialogue coherence (beta coeff=0.372, p<0.01). The authors propose reframing hallucinations as a usable resource while warning that human studies and domain-specific safeguards are needed before adoption.

Problem Statement

Hallucinations in LLMs are usually treated as purely harmful. The paper asks whether these outputs instead express a narrative impulse (confabulation) that increases narrativity and coherence, and whether that property can be measured and potentially used rather than only suppressed.

Main Contribution

Operationalize narrativity as a scalar score using a fine-tuned ELECTRA-large story detector trained on an expert Reddit story dataset

Empirically show hallucinated dialog outputs have higher narrativity than truthful outputs across FaithDial, BEGIN, and HaluEval benchmarks

Demonstrate narrativity predicts hallucination labels (logistic regression) and correlates with dialogue coherence (beta regression)

Argue for a narrative-centered reframing of hallucination as 'confabulation' and outline human-evaluation and application directions

Key Findings

Hallucinated dialog responses have higher mean narrativity than truthful responses on evaluated benchmarks

NumbersFaithDial mean 0.620 vs truth 0.518 (Δ≈0.102); HaluEval 0.655 vs 0.638 (Δ≈0.017); BEGIN 0.658 vs 0.561 (Δ≈0.097)

Higher narrativity significantly predicts an output being labeled a hallucination

NumbersLogistic regression coefficient for narrativity = 0.631 (std err 0.059), p < 0.01

Narrativity is positively associated with automated dialogue coherence

NumbersBeta regression coefficient = 0.372 (std err 0.029), p < 0.01

Results

Narrativity mean (hallucinated vs truth)

ValueFaithDial 0.620 vs 0.518; HaluEval 0.655 vs 0.638; BEGIN 0.658 vs 0.561

BaselineTruthful responses

Predictive power of narrativity for hallucination

ValueLogistic regression coeff = 0.631 (std err 0.059)

BaselineNo narrativity feature

Association between narrativity and coherence

ValueBeta regression coeff = 0.372 (std err 0.029)

BaselineNo narrativity predictor

Who Should Care

What To Try In 7 Days

Score existing model outputs with a narrativity detector to profile narrative intensity

A/B test high-narrativity vs low-narrativity outputs on user satisfaction for explanatory UI copy

Add narrativity as a feature in hallucination detectors and monitor flagged outputs

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Correlation, not causation: analyses show association between narrativity and coherence but do not prove narrativity causes coherence
  • No human-subject experiments: user benefits of narrative-rich confabulations are hypothesized but untested
  • Narrativity detector is automatic and trained on Reddit stories; domain mismatch could bias scores
  • Findings come from dialog benchmarks and may not generalize to other tasks or to high-stakes domains

When Not To Use

  • In truth-sensitive applications (medicine, law, finance) where factual accuracy is mandatory
  • When stakeholders require verifiable citations or provenance for assertions
  • If regulatory or compliance constraints forbid plausible but unverified outputs

Failure Modes

  • Confabulations that read well but are false, leading to persuasive misinformation
  • High narrativity masking factual errors and reducing detectability by users
  • Automatic narrativity and coherence metrics misclassify technical or terse factual responses

Core Entities

Models

  • ELECTRA-large
  • RoBERTa-large (used in DEAM reference)

Metrics

  • Narrativity score (story-detection softmax)
  • Coherence (DEAM)

Datasets

  • FaithDial
  • BEGIN
  • HaluEval
  • Wizard of Wikipedia (WoW) (source for FaithDial)

Benchmarks

  • FaithDial
  • BEGIN
  • HaluEval