LLM explanations speed up fact-checking but cause dangerous over-reliance when they are wrong

October 19, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.3

Cost Impact Score

0.4

Citation Count

6

Authors

Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé, Jordan Boyd-Graber

Links

Abstract / PDF

Why It Matters For Business

LLM explanations let teams verify claims much faster but can mislead people when wrong; for important decisions, prioritize retrieval-grounded workflows or add checks to avoid over-reliance.

Summary TLDR

A human study (1,500 annotations, 80 workers) compares ChatGPT explanations and Wikipedia retrieval for fact-checking hard claims. Explanations and retrieved passages yield similar accuracy (~74% vs 73% vs 59% baseline), but explanations are much faster (~1.0 min vs ~2.5 min). Users heavily over-rely on LLM explanations when those explanations are wrong (human accuracy falls to 35%). Contrastive explanations reduce that over-reliance (raise accuracy to 56% on those cases) but do not beat retrieval overall. Grounding explanations on retrieved passages improves model accuracy (59.5% → 78%).

Problem Statement

People use LLMs and search results to check claims. We need to know which tool helps humans verify facts more accurately and whether LLM explanations help or hurt real users.

Main Contribution

Large human study comparing ChatGPT free-text explanations vs top-10 Wikipedia passages for fact verification on adversarial claims.

Measured speed and accuracy trade-offs: explanations speed decisions but encourage over-reliance when wrong.

Introduced and evaluated contrastive explanations (support + refute) and a retrieval+explanation setup.

Showed grounding retrieved passages into LLM prompts boosts explanation accuracy (59.5% → 78%).

Analyzed retrieval recall effects, confidence calibration, and user rationales to explain failure modes.

Key Findings

ChatGPT explanations and retrieved Wikipedia passages both improve human accuracy over no help.

NumbersExplanation 74% ±0.09 vs Retrieval 73% ±0.12 vs Baseline 59% ±0.12

Reading LLM explanations is ~2.5× faster than reading retrieved passages.

NumbersExplanation 1.01 ±0.45 min vs Retrieval 2.53 ±1.07 min per claim

Users over-rely on LLM explanations when those explanations are wrong, causing large accuracy drops.

NumbersWhen explanation wrong: human accuracy 35% ±0.22 vs Retrieval 54% ±0.26 and Baseline 49% ±0.24

Contrastive explanations reduce over-reliance on wrong explanations but can lower accuracy when the single-shot explanation is correct.

NumbersIf non-contrastive wrong: accuracy 35% → 56%; if non-contrastive correct: 87% → 73%

Grounding ChatGPT on retrieved passages substantially increases its answer accuracy.

NumbersUngrounded explanation 59.5% → Grounded 78.0% accuracy

Retrieval quality strongly affects both explanation and human accuracy.

NumbersTop-10 full-recall contains all evidence 81.5%; explanation accuracy 80.4% (r=1) vs 67.6% (r=0)

Combining retrieval and explanation provides no clear complementary benefit over retrieval alone and increases time.

NumbersRetrieval time 2.5 ±1.1 min vs Retrieval+Explanation 2.7 ±1.0 min; no significant accuracy gain

Results

Accuracy

Value0.74 ±0.09

BaselineBaseline 0.59 ±0.12

Accuracy

Value0.73 ±0.12

BaselineBaseline 0.59 ±0.12

Time per claim

ValueExplanation 1.01 ±0.45 min; Retrieval 2.53 ±1.07 min

Accuracy

Value0.78

BaselineUngrounded 0.595

Accuracy

Value0.35 ±0.22

BaselineBaseline 0.49 ±0.24; Retrieval 0.54 ±0.26

Effect of contrastive explanation on wrong-explanation cases

Value0.56 ±0.24

BaselineNon-contrastive wrong 0.35 ±0.22

Retriever top-10 full-recall

Value0.815

Accuracy

Value80.4% (r=1) vs 67.6% (r=0)

Who Should Care

What To Try In 7 Days

Ground LLM explanations on retrieved passages before showing them to users.

Use retrieval (top passages) as default for high-stakes verification workflows.

Pilot contrastive prompts (support + refute) for triage where users can inspect both sides.

Agent Features

Tool Use

  • retrieval-grounded prompting

Collaboration

  • human-in-the-loop verification

Reproducibility

Data Urls

  • FoolMeTwice (Eisenschlos et al., 2021); Wikipedia snapshots used for retrieval

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Limited participant pool (Prolific) and 16 annotators per condition; may not generalize to experts.
  • Single model checkpoint (GPT-3.5-turbo-0613) and time-limited API snapshot.
  • Static explanations only; no personalization or adaptive prompting tested.
  • Study uses a specific adversarial dataset (FoolMeTwice), which emphasizes hard claims.

When Not To Use

  • High-stakes verification where mistaken LLM explanations could cause harm
  • Workflows with low retrieval recall (missing evidence in top-10)
  • Replacing raw source inspection with LLM prose without grounding

Failure Modes

  • Users adopt LLM answers verbatim even when explanations are factually wrong
  • LLM generates convincing but incorrect supporting/refuting rationales (hallucinations)
  • Contrastive views can still be misleading if both sides contain plausible errors
  • Combining retrieval and explanations can slow users without improving accuracy

Core Entities

Models

  • gpt-3.5-turbo (GPT-3.5-turbo-0613)

Metrics

  • Accuracy
  • time per claim
  • retrieval full-recall
  • user confidence

Datasets

  • FoolMeTwice
  • Wikipedia (retrieved passages)

Context Entities

Models

  • GPT-3.5 family (chat completions)

Metrics

  • Accuracy
  • time and confidence calibration

Datasets

  • FoolMeTwice (adversarial claims)
  • Wikipedia passages used for grounding