Survey: When LLM hallucinations become a source of creativity

February 2, 20246 min

Overview

Decision SnapshotNeeds Validation

The survey compiles existing theory and scattered experiments suggesting promise, but empirical, reproducible benchmarks and automated evaluators are limited; practical use requires domain checks.

Citations10

Evidence Strength0.60

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/1

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 20%

Production readiness: 45%

Novelty: 55%

Authors

Xuhui Jiang, Yuxing Tian, Fengrui Hua, Chengjin Xu, Yuanzhuo Wang, Jian Guo

Links

Abstract / PDF

Why It Matters For Business

Hallucinations can be both a liability and a creative asset; companies should guard critical outputs while experimenting with hallucination-driven ideation in low-risk workflows.

Who Should Care

Summary TLDR

This survey reviews LLM hallucinations and argues they are not only risks but can be harnessed for creativity. It summarizes hallucination taxonomies, detection and reduction techniques, and creativity definitions and metrics from cognitive science. The authors map methods into a two‑phase pipeline: divergent (generate creative hallucinations via training, prompts, multi‑agent and human interaction) and convergent (identify, filter, and evaluate useful hallucinations). They highlight existing benchmarks (HaluEval, TruthfulQA, Med‑HALT), theoretical work linking hallucination and creativity, and urgent needs: richer datasets, automated evaluators, and models that can balance creativity and affi

Problem Statement

Hallucinations make LLM outputs unreliable in high‑stakes settings. At the same time, hallucinations may enable creative discovery. We lack clear theory, measurements, and methods to keep harmful hallucinations out while preserving or leveraging creative ones.

Main Contribution

Review of hallucination taxonomies, detection, and mitigation in LLMs.

Argues for a positive, creativity-oriented view of hallucination supported by historical and cognitive science analogies.

Key Findings

Hallucinations are usually split into factuality (wrong facts) and faithfulness (mismatch with instructions or context).

Practical UsePick detection and mitigation methods based on the hallucination type: factuality needs external facts; faithfulness needs instruction/context alignment.

Evidence RefSection 2.1; Ye et al., 2023; Zhang et al., 2023b

Existing detection and benchmarks target different needs: HaluEval, TruthfulQA, and Med‑HALT test hallucination detection across domains.

Practical UseUse these benchmarks where relevant rather than inventing ad‑hoc tests; choose domain‑specific suites for medicine or law.

Evidence RefSection 2.2; Li et al., 2023; Lin et al., 2022; Umapathi et al., 2023

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AUT / divergent thinking comparisonsHumans > GPT-3 in AUT; judged on 15 scale by two human ratershuman normsAUT / Stevenson et al., 2022Stevenson et al., 2022 reported humans outperform GPT-3 on AUT evaluationsSection 4.3

What To Try In 7 Days

Run an AUT/TTCT style prompt set on your model and compare outputs to human baselines.

Use retrieval or knowledge‑graph augmentation for factual tasks and allow freer generation in brainstorming contexts.

Separate creative runs (open prompts) from validated runs (retrieval + verification) in your pipeline.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Survey synthesizes prior work but provides little new empirical evidence.

Human judge evaluations bring subjectivity and cultural bias.

When Not To Use

High‑stakes factual decision systems (medicine, law, finance) where hallucinations cause harm.

Systems that cannot add external verification or human oversight.

Failure Modes

Model generates plausible but false facts that mislead users.

Evaluation judges favor novelty over usefulness, promoting unsafe outputs.

Core Entities

Models

ChatGPTLLaMAGPT-3GPT-3.5GPT-4

Metrics

fluencyoriginalityflexibilityelaborationuncertainty estimation

Datasets

Only Connect (used for creative problem tasks)

Benchmarks

HaluEvalTruthfulQAMed-HALT

Context Entities

Models

Multi-agent debate setups

Metrics

self-reflection scoringclassifier-based hallucination detectors

Datasets

creative task collections adapted from cognitive tests

Benchmarks

domain-specific hallucination suites (medical, legal)