Overview
A well-powered human study supports the main claims: LLM explanations speed up verification but induce dangerous over-reliance; grounding and contrastive prompts help but do not fully beat retrieval.
Citations6
Evidence Strength0.80
Confidence0.88
Risk Signals11
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 7/8
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 30%
Why It Matters For Business
LLM explanations let teams verify claims much faster but can mislead people when wrong; for important decisions, prioritize retrieval-grounded workflows or add checks to avoid over-reliance.
Who Should Care
Summary TLDR
A human study (1,500 annotations, 80 workers) compares ChatGPT explanations and Wikipedia retrieval for fact-checking hard claims. Explanations and retrieved passages yield similar accuracy (~74% vs 73% vs 59% baseline), but explanations are much faster (~1.0 min vs ~2.5 min). Users heavily over-rely on LLM explanations when those explanations are wrong (human accuracy falls to 35%). Contrastive explanations reduce that over-reliance (raise accuracy to 56% on those cases) but do not beat retrieval overall. Grounding explanations on retrieved passages improves model accuracy (59.5% → 78%).
Problem Statement
People use LLMs and search results to check claims. We need to know which tool helps humans verify facts more accurately and whether LLM explanations help or hurt real users.
Main Contribution
Large human study comparing ChatGPT free-text explanations vs top-10 Wikipedia passages for fact verification on adversarial claims.
Measured speed and accuracy trade-offs: explanations speed decisions but encourage over-reliance when wrong.
Key Findings
ChatGPT explanations and retrieved Wikipedia passages both improve human accuracy over no help.
Reading LLM explanations is ~2.5× faster than reading retrieved passages.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.74 ±0.09 | Baseline 0.59 ±0.12 | +0.15 | All evaluated claims (200 sampled from FoolMeTwice) | Sec 5 (Fig 2) | Sec 5 |
| Accuracy | 0.73 ±0.12 | Baseline 0.59 ±0.12 | +0.14 | All evaluated claims | Sec 5 (Fig 2) | Sec 5 |
What To Try In 7 Days
Ground LLM explanations on retrieved passages before showing them to users.
Use retrieval (top passages) as default for high-stakes verification workflows.
Pilot contrastive prompts (support + refute) for triage where users can inspect both sides.
Agent Features
Tool Use
Collaboration
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Limited participant pool (Prolific) and 16 annotators per condition; may not generalize to experts.
Single model checkpoint (GPT-3.5-turbo-0613) and time-limited API snapshot.
When Not To Use
High-stakes verification where mistaken LLM explanations could cause harm
Workflows with low retrieval recall (missing evidence in top-10)
Failure Modes
Users adopt LLM answers verbatim even when explanations are factually wrong
LLM generates convincing but incorrect supporting/refuting rationales (hallucinations)

