LLM agents can help fact-checking but need context, translations, and human oversight

October 20, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

2

Authors

Dorian Quelle, Alexandre Bovet

Links

Abstract / PDF

Why It Matters For Business

LLM agents can speed fact-check workflows and triage clear misinformation, but they remain fallible on nuance and multilingual claims so human oversight is required.

Summary TLDR

The authors evaluate GPT-3.5 and GPT-4 as fact-checking agents that search the web (via Google) and explain their reasoning. Contextual evidence (search results) boosts accuracy, GPT-4 outperforms GPT-3.5, and translating non-English claims into English usually improves results. Models handle clear false claims well but struggle with ambiguous "half-true" or graded labels. Data leakage from training data is a risk but did not cause a drop in post-cutoff performance in these tests. The agent design records queries and cites sources so humans can audit verdicts.

Problem Statement

Manual fact-checking is slow and cannot scale. The paper asks: can large LLMs (GPT-3.5, GPT-4) act as fact-checking agents that retrieve web evidence, explain reasoning, and match human verdicts reliably across languages?

Main Contribution

Built an LLM agent that iteratively queries Google, gathers previews, and cites sources using the ReAct framework and LangChain.

Systematic comparison of GPT-3.5 vs GPT-4 on PolitiFact (3,000 sampled claims) and a multilingual Data Commons dump, with and without web context.

Showed translation to English often raises accuracy for non-English claims and examined time trends to probe data-leakage concerns.

Key Findings

Providing web context improves fact-check accuracy.

NumbersContext raises accuracy to >80% for clear cases; no-context 63–75% avg

GPT-4 outperforms GPT-3.5 for fact-checking.

NumbersContext boosts true-category accuracy for GPT-4 by ~10.2 percentage points on average

Translating non-English claims to English usually helps.

NumbersExample: Chinese accuracy 43.75% -> 64.86% after translation (+21.11 pp)

Models struggle with ambiguous/granular verdicts.

NumbersVery low exact-match scores for intermediate labels (e.g., 'half-true') in Table 1

Results

Accuracy

Value63–75% (average across models/datasets as reported)

Accuracy

Value80–89% (for non-ambiguous verdicts with context)

Baseline63–75% (no-context)

Translation impact (example)

ValueChinese: 43.75% -> 64.86% (+21.11 pp)

Baselineoriginal-language accuracy

Accuracy

Value~+10.19 percentage points on average

BaselineGPT-3.5 (context)

Who Should Care

What To Try In 7 Days

Prototype a workflow where an LLM agent fetches web previews and returns a coarse true/false verdict for triage.

Add translation-to-English step for non-English inputs before sending to the model and measure accuracy lift.

Log the model's queries and cited domains so editors can quickly audit and correct verdicts.

Agent Features

Memory

  • short-term: accumulates search results per claim
  • no persistent long-term memory reported

Planning

  • iterative web search (up to 3 iterations)
  • decide to stop search and return verdict

Tool Use

  • Google Search API
  • BM25 for local distillation
  • LangChain for orchestration
  • ReAct for reasoning+action

Frameworks

  • ReAct
  • LangChain

Is Agentic

true

Architectures

  • Transformer LLM (GPT family)

Collaboration

  • returns reasoning and cited domains so humans can audit

Optimization Features

Token Efficiency

  • used 16k context window and filtered site previews to fit context

Reproducibility

Data Urls

  • PolitiFact dataset (Misra, Politifact Fact Check Dataset 2022)
  • Data Commons fact-check dump (Data Commons)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Risk of data leakage: training data may contain prior fact-checks.
  • Performance varies widely by language and by how verdicts are labeled.
  • Poor at fine-grained categories like 'half-true' or 'mostly-false'.
  • Relies on Google previews; full-page content was too large for context window.
  • No public code or end-to-end deployment recipes provided.

When Not To Use

  • For fully automated high-stakes decisions without human review.
  • When verdict nuance matters (legal, medical, or finely graded claims).
  • In low-resource languages without reliable translation support.

Failure Modes

  • Gives incorrect fine-grained verdicts while sounding confident.
  • Biased or incomplete search results lead to wrong conclusions.
  • Fails on ambiguous claims that need deeper evidence synthesis.
  • May repeat prior fact-check labels due to training data memorization.

Core Entities

Models

  • GPT-3.5
  • GPT-4

Metrics

  • Accuracy
  • F1

Datasets

  • PolitiFact (sampled 3,000 claims)
  • Data Commons multilingual fact-check dump

Context Entities

Models

  • GPT-3.5
  • GPT-4

Metrics

  • Accuracy
  • F1 over languages

Datasets

  • PolitiFact
  • Data Commons fact-check dump