Overview
Production Readiness
0.6
Novelty Score
0.3
Cost Impact Score
0.4
Citation Count
6
Why It Matters For Business
LLM explanations let teams verify claims much faster but can mislead people when wrong; for important decisions, prioritize retrieval-grounded workflows or add checks to avoid over-reliance.
Summary TLDR
A human study (1,500 annotations, 80 workers) compares ChatGPT explanations and Wikipedia retrieval for fact-checking hard claims. Explanations and retrieved passages yield similar accuracy (~74% vs 73% vs 59% baseline), but explanations are much faster (~1.0 min vs ~2.5 min). Users heavily over-rely on LLM explanations when those explanations are wrong (human accuracy falls to 35%). Contrastive explanations reduce that over-reliance (raise accuracy to 56% on those cases) but do not beat retrieval overall. Grounding explanations on retrieved passages improves model accuracy (59.5% → 78%).
Problem Statement
People use LLMs and search results to check claims. We need to know which tool helps humans verify facts more accurately and whether LLM explanations help or hurt real users.
Main Contribution
Large human study comparing ChatGPT free-text explanations vs top-10 Wikipedia passages for fact verification on adversarial claims.
Measured speed and accuracy trade-offs: explanations speed decisions but encourage over-reliance when wrong.
Introduced and evaluated contrastive explanations (support + refute) and a retrieval+explanation setup.
Showed grounding retrieved passages into LLM prompts boosts explanation accuracy (59.5% → 78%).
Analyzed retrieval recall effects, confidence calibration, and user rationales to explain failure modes.
Key Findings
ChatGPT explanations and retrieved Wikipedia passages both improve human accuracy over no help.
Reading LLM explanations is ~2.5× faster than reading retrieved passages.
Users over-rely on LLM explanations when those explanations are wrong, causing large accuracy drops.
Contrastive explanations reduce over-reliance on wrong explanations but can lower accuracy when the single-shot explanation is correct.
Grounding ChatGPT on retrieved passages substantially increases its answer accuracy.
Retrieval quality strongly affects both explanation and human accuracy.
Combining retrieval and explanation provides no clear complementary benefit over retrieval alone and increases time.
Results
Accuracy
Accuracy
Time per claim
Accuracy
Accuracy
Effect of contrastive explanation on wrong-explanation cases
Retriever top-10 full-recall
Accuracy
Who Should Care
What To Try In 7 Days
Ground LLM explanations on retrieved passages before showing them to users.
Use retrieval (top passages) as default for high-stakes verification workflows.
Pilot contrastive prompts (support + refute) for triage where users can inspect both sides.
Agent Features
Tool Use
- retrieval-grounded prompting
Collaboration
- human-in-the-loop verification
Reproducibility
Data Urls
- FoolMeTwice (Eisenschlos et al., 2021); Wikipedia snapshots used for retrieval
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Limited participant pool (Prolific) and 16 annotators per condition; may not generalize to experts.
- Single model checkpoint (GPT-3.5-turbo-0613) and time-limited API snapshot.
- Static explanations only; no personalization or adaptive prompting tested.
- Study uses a specific adversarial dataset (FoolMeTwice), which emphasizes hard claims.
When Not To Use
- High-stakes verification where mistaken LLM explanations could cause harm
- Workflows with low retrieval recall (missing evidence in top-10)
- Replacing raw source inspection with LLM prose without grounding
Failure Modes
- Users adopt LLM answers verbatim even when explanations are factually wrong
- LLM generates convincing but incorrect supporting/refuting rationales (hallucinations)
- Contrastive views can still be misleading if both sides contain plausible errors
- Combining retrieval and explanations can slow users without improving accuracy
Core Entities
Models
- gpt-3.5-turbo (GPT-3.5-turbo-0613)
Metrics
- Accuracy
- time per claim
- retrieval full-recall
- user confidence
Datasets
- FoolMeTwice
- Wikipedia (retrieved passages)
Context Entities
Models
- GPT-3.5 family (chat completions)
Metrics
- Accuracy
- time and confidence calibration
Datasets
- FoolMeTwice (adversarial claims)
- Wikipedia passages used for grounding

