Overview
The experiment has a robust sample (N=419) and clear quantitative effects, but findings are limited to GPT-3.5 outputs, TruthfulQA-style Q/A format, and US Prolific participants.
Citations11
Evidence Strength0.90
Confidence0.90
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 30%
Novelty: 50%
Why It Matters For Business
A short warning label reduces how believable AI-generated false claims feel and increases negative feedback. Use warnings to improve user flagging and training signals without hurting trust in accurate outputs.
Who Should Care
Summary TLDR
This human-subjects study (N=419) tested whether a short warning label improves people's ability to spot LLM 'hallucinations' (fabricated or unverifiable claims). Participants saw genuine, minor-hallucination, and major-hallucination answers (generated from TruthfulQA via GPT-3.5). A single warning reduced perceived accuracy and increased dislikes for hallucinations, improved detection rates slightly, but did not meaningfully reduce likes or shares. Minor hallucinations were the hardest to spot. Practical takeaway: simple UI warnings help readers notice errors but are not enough to stop engagement or propagation.
Problem Statement
LLMs sometimes produce incorrect or fabricated text ('hallucinations'). We do not know how well untrained users can detect different severity levels of hallucination and whether a short warning label helps or causes blind skepticism.
Main Contribution
Design and run a controlled human experiment (N=419) comparing genuine, minor, and major hallucinated answers from GPT-3.5 using TruthfulQA prompts.
Measure perceived accuracy and engagement (like, dislike, share) under two conditions: with or without a short warning tag.
Key Findings
A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.
People reliably rank answers by truthfulness: genuine > minor hallucination > major hallucination.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | genuine 3.99, minor 3.21, major 2.43 (scale 1-5) | — | genuine>minor>major | All participants, collapsed | Figure 3(a); Table 3 | Table 3 |
| Accuracy | genuine 72.28%, minor 28.56%, major 52.94% | Chance level defined as 40% | Minor below chance; major above chance | Control and warning combined | Table 3; Discussion | Table 3 |
What To Try In 7 Days
Add a short, visible warning on AI answers: 'Responses may contain inaccurate information.'
Track 'dislike' clicks as a low-cost signal to feed RLHF or model monitoring pipelines.
A/B test warning vs no-warning on a small live cohort and measure dislike, share, and support tickets.
Reproducibility
Risks & Boundaries
Limitations
Participants were US-based Prolific workers and may be more tech-savvy than the general population.
Stimuli were generated using GPT-3.5-Turbo and game-style prompts; results may differ with other LLMs or generation methods.
When Not To Use
Do not assume warnings stop sharing or liking on social platforms.
Do not generalize detection rates to other LLMs or non-English audiences without retesting.
Failure Modes
Warnings trigger only mild skepticism and may not change sharing behavior.
Minor hallucinations can pass as truthful and evade both users and simple warning-based defenses.

