A short warning reduces how believable LLM 'hallucinations' feel, but it does not stop people from liking or sharing them.

Overview

Decision SnapshotNeeds Validation

The experiment has a robust sample (N=419) and clear quantitative effects, but findings are limited to GPT-3.5 outputs, TruthfulQA-style Q/A format, and US Prolific participants.

Citations11

Evidence Strength0.90

Confidence0.90

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 50%

Authors

Mahjabin Nahar, Haeseung Seo, Eun-Ju Lee, Aiping Xiong, Dongwon Lee

Links

Abstract / PDF / Data

Why It Matters For Business

A short warning label reduces how believable AI-generated false claims feel and increases negative feedback. Use warnings to improve user flagging and training signals without hurting trust in accurate outputs.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

This human-subjects study (N=419) tested whether a short warning label improves people's ability to spot LLM 'hallucinations' (fabricated or unverifiable claims). Participants saw genuine, minor-hallucination, and major-hallucination answers (generated from TruthfulQA via GPT-3.5). A single warning reduced perceived accuracy and increased dislikes for hallucinations, improved detection rates slightly, but did not meaningfully reduce likes or shares. Minor hallucinations were the hardest to spot. Practical takeaway: simple UI warnings help readers notice errors but are not enough to stop engagement or propagation.

Problem Statement

LLMs sometimes produce incorrect or fabricated text ('hallucinations'). We do not know how well untrained users can detect different severity levels of hallucination and whether a short warning label helps or causes blind skepticism.

Main Contribution

Design and run a controlled human experiment (N=419) comparing genuine, minor, and major hallucinated answers from GPT-3.5 using TruthfulQA prompts.

Measure perceived accuracy and engagement (like, dislike, share) under two conditions: with or without a short warning tag.

Key Findings

A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.

NumbersPerceived accuracy: minor CON 3.27 → WARN 3.13; major CON 2.56 → WARN 2.30; genuine CON 3.97 → WARN 4.00

Practical UseAdd brief warning text in UIs to help users spot incorrect LLM replies while avoiding reduced trust in correct answers.

Evidence RefTable 3; Finding 1; Fig.3(a)

People reliably rank answers by truthfulness: genuine > minor hallucination > major hallucination.

NumbersMean perceived accuracy — genuine 3.99, minor 3.21, major 2.43; detection rates — genuine 72.28%, minor 28.56%, major 52

Practical UseExpect minor fabrications to slip past users more often; focus detection and UX on subtle (minor) errors.

Evidence RefTable 3; Finding 2; Fig.3(a)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	genuine 3.99, minor 3.21, major 2.43 (scale 1-5)	—	genuine>minor>major	All participants, collapsed	Figure 3(a); Table 3	Table 3
Accuracy	genuine 72.28%, minor 28.56%, major 52.94%	Chance level defined as 40%	Minor below chance; major above chance	Control and warning combined	Table 3; Discussion	Table 3

What To Try In 7 Days

Add a short, visible warning on AI answers: 'Responses may contain inaccurate information.'

Track 'dislike' clicks as a low-cost signal to feed RLHF or model monitoring pipelines.

A/B test warning vs no-warning on a small live cohort and measure dislike, share, and support tickets.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/MahjabinNahar/fakes-of-varying-shades-surveymaterials

Risks & Boundaries

Limitations

Participants were US-based Prolific workers and may be more tech-savvy than the general population.

Stimuli were generated using GPT-3.5-Turbo and game-style prompts; results may differ with other LLMs or generation methods.

When Not To Use

Do not assume warnings stop sharing or liking on social platforms.

Do not generalize detection rates to other LLMs or non-English audiences without retesting.

Failure Modes

Warnings trigger only mild skepticism and may not change sharing behavior.

Minor hallucinations can pass as truthful and evade both users and simple warning-based defenses.

Core Entities

Models

GPT-3.5-TurboGPT-3 (used for entailment checks)

Metrics

AccuracyLike/Dislike/Share rates

Datasets

TruthfulQA (selected 54 questions)

Benchmarks

TruthfulQA

Context Entities

Models

References to GPT-4/GPT-3 family in discussion

Metrics

ANOVA; η²p effect sizes

Datasets

Mentioned benchmarks: HaluEval, FADE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.

People reliably rank answers by truthfulness: genuine > minor hallucination > major hallucination.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding