Survey: When LLM hallucinations become a source of creativity

February 2, 20246 min

Overview

Production Readiness

0.45

Novelty Score

0.55

Cost Impact Score

0.2

Citation Count

10

Authors

Xuhui Jiang, Yuxing Tian, Fengrui Hua, Chengjin Xu, Yuanzhuo Wang, Jian Guo

Links

Abstract / PDF

Why It Matters For Business

Hallucinations can be both a liability and a creative asset; companies should guard critical outputs while experimenting with hallucination-driven ideation in low-risk workflows.

Summary TLDR

This survey reviews LLM hallucinations and argues they are not only risks but can be harnessed for creativity. It summarizes hallucination taxonomies, detection and reduction techniques, and creativity definitions and metrics from cognitive science. The authors map methods into a two‑phase pipeline: divergent (generate creative hallucinations via training, prompts, multi‑agent and human interaction) and convergent (identify, filter, and evaluate useful hallucinations). They highlight existing benchmarks (HaluEval, TruthfulQA, Med‑HALT), theoretical work linking hallucination and creativity, and urgent needs: richer datasets, automated evaluators, and models that can balance creativity and affi

Problem Statement

Hallucinations make LLM outputs unreliable in high‑stakes settings. At the same time, hallucinations may enable creative discovery. We lack clear theory, measurements, and methods to keep harmful hallucinations out while preserving or leveraging creative ones.

Main Contribution

Review of hallucination taxonomies, detection, and mitigation in LLMs.

Argues for a positive, creativity-oriented view of hallucination supported by historical and cognitive science analogies.

Frames harnessing hallucination via divergent (generate) and convergent (evaluate/refine) phases and surveys related methods.

Summarizes evaluation approaches for LLM creativity and lists gaps: benchmarks, datasets, and automatic evaluation.

Key Findings

Hallucinations are usually split into factuality (wrong facts) and faithfulness (mismatch with instructions or context).

Existing detection and benchmarks target different needs: HaluEval, TruthfulQA, and Med‑HALT test hallucination detection across domains.

Studies that adapted human creativity tests show LLMs can generate creative items but still lag behind humans on tasks like AUT.

NumbersAUT comparisons and human judges used; Stevenson et al., 2022

A two‑phase pipeline (divergent generation + convergent selection/evaluation) is a practical framework to harness hallucinations for creativity.

There is growing theoretical work linking hallucination and creativity, but empirical benchmarks and automated evaluators are sparse.

Results

AUT / divergent thinking comparisons

ValueHumans > GPT-3 in AUT; judged on 1–5 scale by two human raters

Baselinehuman norms

Who Should Care

What To Try In 7 Days

Run an AUT/TTCT style prompt set on your model and compare outputs to human baselines.

Use retrieval or knowledge‑graph augmentation for factual tasks and allow freer generation in brainstorming contexts.

Separate creative runs (open prompts) from validated runs (retrieval + verification) in your pipeline.

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Survey synthesizes prior work but provides little new empirical evidence.
  • Human judge evaluations bring subjectivity and cultural bias.
  • Benchmarks are fragmented and domain limited.
  • Automated creativity evaluation methods are underdeveloped.

When Not To Use

  • High‑stakes factual decision systems (medicine, law, finance) where hallucinations cause harm.
  • Systems that cannot add external verification or human oversight.

Failure Modes

  • Model generates plausible but false facts that mislead users.
  • Evaluation judges favor novelty over usefulness, promoting unsafe outputs.
  • Automatic self‑assessment relies on the same model and misses systematic errors.

Core Entities

Models

  • ChatGPT
  • LLaMA
  • GPT-3
  • GPT-3.5
  • GPT-4

Metrics

  • fluency
  • originality
  • flexibility
  • elaboration
  • uncertainty estimation

Datasets

  • Only Connect (used for creative problem tasks)

Benchmarks

  • HaluEval
  • TruthfulQA
  • Med-HALT

Context Entities

Models

  • Multi-agent debate setups

Metrics

  • self-reflection scoring
  • classifier-based hallucination detectors

Datasets

  • creative task collections adapted from cognitive tests

Benchmarks

  • domain-specific hallucination suites (medical, legal)