Overview
The paper shows consistent statistical associations across three public benchmarks using automatic metrics, but offers no human user studies and does not release tooling, so results are promising but preliminary.
Citations11
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 1/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
Hallucinations often produce more coherent, story-like text; that trait can be useful for product flows that prioritize readability, persuasion, or ideation, but it creates risk in truth-sensitive domains and needs human validation.
Who Should Care
Summary TLDR
The paper argues that many LLM 'hallucinations' are better described as confabulations — narrative-rich, coherent outputs that fill gaps with plausible details. Using an ELECTRA-large story detector across three dialog benchmarks (FaithDial, BEGIN, HaluEval), the authors show hallucinated responses score higher on narrativity than factual responses and that narrativity predicts hallucination labels (logistic coeff=0.631, p<0.01). Narrativity also correlates with automated dialogue coherence (beta coeff=0.372, p<0.01). The authors propose reframing hallucinations as a usable resource while warning that human studies and domain-specific safeguards are needed before adoption.
Problem Statement
Hallucinations in LLMs are usually treated as purely harmful. The paper asks whether these outputs instead express a narrative impulse (confabulation) that increases narrativity and coherence, and whether that property can be measured and potentially used rather than only suppressed.
Main Contribution
Operationalize narrativity as a scalar score using a fine-tuned ELECTRA-large story detector trained on an expert Reddit story dataset
Empirically show hallucinated dialog outputs have higher narrativity than truthful outputs across FaithDial, BEGIN, and HaluEval benchmarks
Key Findings
Hallucinated dialog responses have higher mean narrativity than truthful responses on evaluated benchmarks
Higher narrativity significantly predicts an output being labeled a hallucination
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Narrativity mean (hallucinated vs truth) | FaithDial 0.620 vs 0.518; HaluEval 0.655 vs 0.638; BEGIN 0.658 vs 0.561 | Truthful responses | Δ≈0.102 / 0.017 / 0.097 | FaithDial, HaluEval, BEGIN | Table 2 reports means and counts for narrativity by label | Table 2 |
| Predictive power of narrativity for hallucination | Logistic regression coeff = 0.631 (std err 0.059) | No narrativity feature | — | FaithDial + BEGIN aggregated (43,842 observations) | Table 3 shows positive significant coefficient (p<0.01) | Table 3 |
What To Try In 7 Days
Score existing model outputs with a narrativity detector to profile narrative intensity
A/B test high-narrativity vs low-narrativity outputs on user satisfaction for explanatory UI copy
Add narrativity as a feature in hallucination detectors and monitor flagged outputs
Reproducibility
Risks & Boundaries
Limitations
Correlation, not causation: analyses show association between narrativity and coherence but do not prove narrativity causes coherence
No human-subject experiments: user benefits of narrative-rich confabulations are hypothesized but untested
When Not To Use
In truth-sensitive applications (medicine, law, finance) where factual accuracy is mandatory
When stakeholders require verifiable citations or provenance for assertions
Failure Modes
Confabulations that read well but are false, leading to persuasive misinformation
High narrativity masking factual errors and reducing detectability by users

