A short warning reduces how believable LLM 'hallucinations' feel, but it does not stop people from liking or sharing them.

April 4, 20247 min

Overview

Decision SnapshotNeeds Validation

The experiment has a robust sample (N=419) and clear quantitative effects, but findings are limited to GPT-3.5 outputs, TruthfulQA-style Q/A format, and US Prolific participants.

Citations11

Evidence Strength0.90

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 50%

Authors

Mahjabin Nahar, Haeseung Seo, Eun-Ju Lee, Aiping Xiong, Dongwon Lee

Links

Abstract / PDF / Data

Why It Matters For Business

A short warning label reduces how believable AI-generated false claims feel and increases negative feedback. Use warnings to improve user flagging and training signals without hurting trust in accurate outputs.

Who Should Care

Summary TLDR

This human-subjects study (N=419) tested whether a short warning label improves people's ability to spot LLM 'hallucinations' (fabricated or unverifiable claims). Participants saw genuine, minor-hallucination, and major-hallucination answers (generated from TruthfulQA via GPT-3.5). A single warning reduced perceived accuracy and increased dislikes for hallucinations, improved detection rates slightly, but did not meaningfully reduce likes or shares. Minor hallucinations were the hardest to spot. Practical takeaway: simple UI warnings help readers notice errors but are not enough to stop engagement or propagation.

Problem Statement

LLMs sometimes produce incorrect or fabricated text ('hallucinations'). We do not know how well untrained users can detect different severity levels of hallucination and whether a short warning label helps or causes blind skepticism.

Main Contribution

Design and run a controlled human experiment (N=419) comparing genuine, minor, and major hallucinated answers from GPT-3.5 using TruthfulQA prompts.

Measure perceived accuracy and engagement (like, dislike, share) under two conditions: with or without a short warning tag.

Key Findings

A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.

NumbersPerceived accuracy: minor CON 3.27 → WARN 3.13; major CON 2.56 → WARN 2.30; genuine CON 3.97 → WARN 4.00

Practical UseAdd brief warning text in UIs to help users spot incorrect LLM replies while avoiding reduced trust in correct answers.

Evidence RefTable 3; Finding 1; Fig.3(a)

People reliably rank answers by truthfulness: genuine > minor hallucination > major hallucination.

NumbersMean perceived accuracy — genuine 3.99, minor 3.21, major 2.43; detection rates — genuine 72.28%, minor 28.56%, major 52

Practical UseExpect minor fabrications to slip past users more often; focus detection and UX on subtle (minor) errors.

Evidence RefTable 3; Finding 2; Fig.3(a)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracygenuine 3.99, minor 3.21, major 2.43 (scale 1-5)genuine>minor>majorAll participants, collapsedFigure 3(a); Table 3Table 3
Accuracygenuine 72.28%, minor 28.56%, major 52.94%Chance level defined as 40%Minor below chance; major above chanceControl and warning combinedTable 3; DiscussionTable 3

What To Try In 7 Days

Add a short, visible warning on AI answers: 'Responses may contain inaccurate information.'

Track 'dislike' clicks as a low-cost signal to feed RLHF or model monitoring pipelines.

A/B test warning vs no-warning on a small live cohort and measure dislike, share, and support tickets.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Participants were US-based Prolific workers and may be more tech-savvy than the general population.

Stimuli were generated using GPT-3.5-Turbo and game-style prompts; results may differ with other LLMs or generation methods.

When Not To Use

Do not assume warnings stop sharing or liking on social platforms.

Do not generalize detection rates to other LLMs or non-English audiences without retesting.

Failure Modes

Warnings trigger only mild skepticism and may not change sharing behavior.

Minor hallucinations can pass as truthful and evade both users and simple warning-based defenses.

Core Entities

Models

GPT-3.5-TurboGPT-3 (used for entailment checks)

Metrics

AccuracyLike/Dislike/Share rates

Datasets

TruthfulQA (selected 54 questions)

Benchmarks

TruthfulQA

Context Entities

Models

References to GPT-4/GPT-3 family in discussion

Metrics

ANOVA; η²p effect sizes

Datasets

Mentioned benchmarks: HaluEval, FADE