Overview
Production Readiness
0.45
Novelty Score
0.55
Cost Impact Score
0.2
Citation Count
10
Why It Matters For Business
Hallucinations can be both a liability and a creative asset; companies should guard critical outputs while experimenting with hallucination-driven ideation in low-risk workflows.
Summary TLDR
This survey reviews LLM hallucinations and argues they are not only risks but can be harnessed for creativity. It summarizes hallucination taxonomies, detection and reduction techniques, and creativity definitions and metrics from cognitive science. The authors map methods into a two‑phase pipeline: divergent (generate creative hallucinations via training, prompts, multi‑agent and human interaction) and convergent (identify, filter, and evaluate useful hallucinations). They highlight existing benchmarks (HaluEval, TruthfulQA, Med‑HALT), theoretical work linking hallucination and creativity, and urgent needs: richer datasets, automated evaluators, and models that can balance creativity and affi
Problem Statement
Hallucinations make LLM outputs unreliable in high‑stakes settings. At the same time, hallucinations may enable creative discovery. We lack clear theory, measurements, and methods to keep harmful hallucinations out while preserving or leveraging creative ones.
Main Contribution
Review of hallucination taxonomies, detection, and mitigation in LLMs.
Argues for a positive, creativity-oriented view of hallucination supported by historical and cognitive science analogies.
Frames harnessing hallucination via divergent (generate) and convergent (evaluate/refine) phases and surveys related methods.
Summarizes evaluation approaches for LLM creativity and lists gaps: benchmarks, datasets, and automatic evaluation.
Key Findings
Hallucinations are usually split into factuality (wrong facts) and faithfulness (mismatch with instructions or context).
Existing detection and benchmarks target different needs: HaluEval, TruthfulQA, and Med‑HALT test hallucination detection across domains.
Studies that adapted human creativity tests show LLMs can generate creative items but still lag behind humans on tasks like AUT.
A two‑phase pipeline (divergent generation + convergent selection/evaluation) is a practical framework to harness hallucinations for creativity.
There is growing theoretical work linking hallucination and creativity, but empirical benchmarks and automated evaluators are sparse.
Results
AUT / divergent thinking comparisons
Who Should Care
What To Try In 7 Days
Run an AUT/TTCT style prompt set on your model and compare outputs to human baselines.
Use retrieval or knowledge‑graph augmentation for factual tasks and allow freer generation in brainstorming contexts.
Separate creative runs (open prompts) from validated runs (retrieval + verification) in your pipeline.
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Survey synthesizes prior work but provides little new empirical evidence.
- Human judge evaluations bring subjectivity and cultural bias.
- Benchmarks are fragmented and domain limited.
- Automated creativity evaluation methods are underdeveloped.
When Not To Use
- High‑stakes factual decision systems (medicine, law, finance) where hallucinations cause harm.
- Systems that cannot add external verification or human oversight.
Failure Modes
- Model generates plausible but false facts that mislead users.
- Evaluation judges favor novelty over usefulness, promoting unsafe outputs.
- Automatic self‑assessment relies on the same model and misses systematic errors.
Core Entities
Models
- ChatGPT
- LLaMA
- GPT-3
- GPT-3.5
- GPT-4
Metrics
- fluency
- originality
- flexibility
- elaboration
- uncertainty estimation
Datasets
- Only Connect (used for creative problem tasks)
Benchmarks
- HaluEval
- TruthfulQA
- Med-HALT
Context Entities
Models
- Multi-agent debate setups
Metrics
- self-reflection scoring
- classifier-based hallucination detectors
Datasets
- creative task collections adapted from cognitive tests
Benchmarks
- domain-specific hallucination suites (medical, legal)

