Overview
The survey compiles existing theory and scattered experiments suggesting promise, but empirical, reproducible benchmarks and automated evaluators are limited; practical use requires domain checks.
Citations10
Evidence Strength0.60
Confidence0.70
Risk Signals9
Trust Signals
Findings with numeric evidence: 1/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/1
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 20%
Production readiness: 45%
Novelty: 55%
Why It Matters For Business
Hallucinations can be both a liability and a creative asset; companies should guard critical outputs while experimenting with hallucination-driven ideation in low-risk workflows.
Who Should Care
Summary TLDR
This survey reviews LLM hallucinations and argues they are not only risks but can be harnessed for creativity. It summarizes hallucination taxonomies, detection and reduction techniques, and creativity definitions and metrics from cognitive science. The authors map methods into a two‑phase pipeline: divergent (generate creative hallucinations via training, prompts, multi‑agent and human interaction) and convergent (identify, filter, and evaluate useful hallucinations). They highlight existing benchmarks (HaluEval, TruthfulQA, Med‑HALT), theoretical work linking hallucination and creativity, and urgent needs: richer datasets, automated evaluators, and models that can balance creativity and affi
Problem Statement
Hallucinations make LLM outputs unreliable in high‑stakes settings. At the same time, hallucinations may enable creative discovery. We lack clear theory, measurements, and methods to keep harmful hallucinations out while preserving or leveraging creative ones.
Main Contribution
Review of hallucination taxonomies, detection, and mitigation in LLMs.
Argues for a positive, creativity-oriented view of hallucination supported by historical and cognitive science analogies.
Key Findings
Hallucinations are usually split into factuality (wrong facts) and faithfulness (mismatch with instructions or context).
Existing detection and benchmarks target different needs: HaluEval, TruthfulQA, and Med‑HALT test hallucination detection across domains.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| AUT / divergent thinking comparisons | Humans > GPT-3 in AUT; judged on 1–5 scale by two human raters | human norms | — | AUT / Stevenson et al., 2022 | Stevenson et al., 2022 reported humans outperform GPT-3 on AUT evaluations | Section 4.3 |
What To Try In 7 Days
Run an AUT/TTCT style prompt set on your model and compare outputs to human baselines.
Use retrieval or knowledge‑graph augmentation for factual tasks and allow freer generation in brainstorming contexts.
Separate creative runs (open prompts) from validated runs (retrieval + verification) in your pipeline.
Reproducibility
Risks & Boundaries
Limitations
Survey synthesizes prior work but provides little new empirical evidence.
Human judge evaluations bring subjectivity and cultural bias.
When Not To Use
High‑stakes factual decision systems (medicine, law, finance) where hallucinations cause harm.
Systems that cannot add external verification or human oversight.
Failure Modes
Model generates plausible but false facts that mislead users.
Evaluation judges favor novelty over usefulness, promoting unsafe outputs.

