Overview
Production Readiness
0.3
Novelty Score
0.5
Cost Impact Score
0.2
Citation Count
11
Why It Matters For Business
A short warning label reduces how believable AI-generated false claims feel and increases negative feedback. Use warnings to improve user flagging and training signals without hurting trust in accurate outputs.
Summary TLDR
This human-subjects study (N=419) tested whether a short warning label improves people's ability to spot LLM 'hallucinations' (fabricated or unverifiable claims). Participants saw genuine, minor-hallucination, and major-hallucination answers (generated from TruthfulQA via GPT-3.5). A single warning reduced perceived accuracy and increased dislikes for hallucinations, improved detection rates slightly, but did not meaningfully reduce likes or shares. Minor hallucinations were the hardest to spot. Practical takeaway: simple UI warnings help readers notice errors but are not enough to stop engagement or propagation.
Problem Statement
LLMs sometimes produce incorrect or fabricated text ('hallucinations'). We do not know how well untrained users can detect different severity levels of hallucination and whether a short warning label helps or causes blind skepticism.
Main Contribution
Design and run a controlled human experiment (N=419) comparing genuine, minor, and major hallucinated answers from GPT-3.5 using TruthfulQA prompts.
Measure perceived accuracy and engagement (like, dislike, share) under two conditions: with or without a short warning tag.
Show that warnings lower perceived accuracy and increase dislike for hallucinations, but have little impact on likes or shares; minor hallucinations are most deceptive.
Key Findings
A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.
People reliably rank answers by truthfulness: genuine > minor hallucination > major hallucination.
Warning increased dislikes but did not reduce likes or shares for hallucinated content.
Human detection of hallucinations is limited and often below practical levels.
Results
Accuracy
Accuracy
Effect of warning on detection
Engagement rates (mean proportions)
Who Should Care
What To Try In 7 Days
Add a short, visible warning on AI answers: 'Responses may contain inaccurate information.'
Track 'dislike' clicks as a low-cost signal to feed RLHF or model monitoring pipelines.
A/B test warning vs no-warning on a small live cohort and measure dislike, share, and support tickets.
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Participants were US-based Prolific workers and may be more tech-savvy than the general population.
- Stimuli were generated using GPT-3.5-Turbo and game-style prompts; results may differ with other LLMs or generation methods.
- Study used a Q/A format (TruthfulQA) and focused on two handcrafted hallucination severity levels; other formats or finer hallucination types were not tested.
When Not To Use
- Do not assume warnings stop sharing or liking on social platforms.
- Do not generalize detection rates to other LLMs or non-English audiences without retesting.
Failure Modes
- Warnings trigger only mild skepticism and may not change sharing behavior.
- Minor hallucinations can pass as truthful and evade both users and simple warning-based defenses.
- A warning could reduce overall preference for content if users become overcautious in some populations.
Core Entities
Models
- GPT-3.5-Turbo
- GPT-3 (used for entailment checks)
Metrics
- Accuracy
- Like/Dislike/Share rates
Datasets
- TruthfulQA (selected 54 questions)
Benchmarks
- TruthfulQA
Context Entities
Models
- References to GPT-4/GPT-3 family in discussion
Metrics
- ANOVA; η²p effect sizes
Datasets
- Mentioned benchmarks: HaluEval, FADE

