A short warning reduces how believable LLM 'hallucinations' feel, but it does not stop people from liking or sharing them.

April 4, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.2

Citation Count

11

Authors

Mahjabin Nahar, Haeseung Seo, Eun-Ju Lee, Aiping Xiong, Dongwon Lee

Links

Abstract / PDF

Why It Matters For Business

A short warning label reduces how believable AI-generated false claims feel and increases negative feedback. Use warnings to improve user flagging and training signals without hurting trust in accurate outputs.

Summary TLDR

This human-subjects study (N=419) tested whether a short warning label improves people's ability to spot LLM 'hallucinations' (fabricated or unverifiable claims). Participants saw genuine, minor-hallucination, and major-hallucination answers (generated from TruthfulQA via GPT-3.5). A single warning reduced perceived accuracy and increased dislikes for hallucinations, improved detection rates slightly, but did not meaningfully reduce likes or shares. Minor hallucinations were the hardest to spot. Practical takeaway: simple UI warnings help readers notice errors but are not enough to stop engagement or propagation.

Problem Statement

LLMs sometimes produce incorrect or fabricated text ('hallucinations'). We do not know how well untrained users can detect different severity levels of hallucination and whether a short warning label helps or causes blind skepticism.

Main Contribution

Design and run a controlled human experiment (N=419) comparing genuine, minor, and major hallucinated answers from GPT-3.5 using TruthfulQA prompts.

Measure perceived accuracy and engagement (like, dislike, share) under two conditions: with or without a short warning tag.

Show that warnings lower perceived accuracy and increase dislike for hallucinations, but have little impact on likes or shares; minor hallucinations are most deceptive.

Key Findings

A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.

NumbersPerceived accuracy: minor CON 3.27 → WARN 3.13; major CON 2.56 → WARN 2.30; genuine CON 3.97 → WARN 4.00

People reliably rank answers by truthfulness: genuine > minor hallucination > major hallucination.

NumbersMean perceived accuracy — genuine 3.99, minor 3.21, major 2.43; detection rates — genuine 72.28%, minor 28.56%, major 52

Warning increased dislikes but did not reduce likes or shares for hallucinated content.

NumbersDislike rate CON 0.266 → WARN 0.308; likes and shares show no significant change (p>0.05)

Human detection of hallucinations is limited and often below practical levels.

NumbersControl detection: minor 25.28% (below 40% chance), major 48.39%; warning improved minor→31.76%, major→57.39%

Results

Accuracy

Valuegenuine 3.99, minor 3.21, major 2.43 (scale 1-5)

Accuracy

Valuegenuine 72.28%, minor 28.56%, major 52.94%

BaselineChance level defined as 40%

Effect of warning on detection

Valueminor: CON 25.28% → WARN 31.76%; major: CON 48.39% → WARN 57.39%

Engagement rates (mean proportions)

ValueLike: genuine 0.71, minor 0.52, major 0.34; Dislike: genuine 0.10, minor 0.28, major 0.48; Share: genuine 0.15, minor 0.

Who Should Care

What To Try In 7 Days

Add a short, visible warning on AI answers: 'Responses may contain inaccurate information.'

Track 'dislike' clicks as a low-cost signal to feed RLHF or model monitoring pipelines.

A/B test warning vs no-warning on a small live cohort and measure dislike, share, and support tickets.

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Participants were US-based Prolific workers and may be more tech-savvy than the general population.
  • Stimuli were generated using GPT-3.5-Turbo and game-style prompts; results may differ with other LLMs or generation methods.
  • Study used a Q/A format (TruthfulQA) and focused on two handcrafted hallucination severity levels; other formats or finer hallucination types were not tested.

When Not To Use

  • Do not assume warnings stop sharing or liking on social platforms.
  • Do not generalize detection rates to other LLMs or non-English audiences without retesting.

Failure Modes

  • Warnings trigger only mild skepticism and may not change sharing behavior.
  • Minor hallucinations can pass as truthful and evade both users and simple warning-based defenses.
  • A warning could reduce overall preference for content if users become overcautious in some populations.

Core Entities

Models

  • GPT-3.5-Turbo
  • GPT-3 (used for entailment checks)

Metrics

  • Accuracy
  • Like/Dislike/Share rates

Datasets

  • TruthfulQA (selected 54 questions)

Benchmarks

  • TruthfulQA

Context Entities

Models

  • References to GPT-4/GPT-3 family in discussion

Metrics

  • ANOVA; η²p effect sizes

Datasets

  • Mentioned benchmarks: HaluEval, FADE