LLM agents can help fact-checking but need context, translations, and human oversight

October 20, 20237 min

Overview

Decision SnapshotNeeds Validation

Paper provides clear experiments on public datasets and documents the agent's queries and citations, but results depend on Google search quality and translation; expect moderate engineering effort to deploy safely.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Dorian Quelle, Alexandre Bovet

Links

Abstract / PDF / Data

Why It Matters For Business

LLM agents can speed fact-check workflows and triage clear misinformation, but they remain fallible on nuance and multilingual claims so human oversight is required.

Who Should Care

Summary TLDR

The authors evaluate GPT-3.5 and GPT-4 as fact-checking agents that search the web (via Google) and explain their reasoning. Contextual evidence (search results) boosts accuracy, GPT-4 outperforms GPT-3.5, and translating non-English claims into English usually improves results. Models handle clear false claims well but struggle with ambiguous "half-true" or graded labels. Data leakage from training data is a risk but did not cause a drop in post-cutoff performance in these tests. The agent design records queries and cites sources so humans can audit verdicts.

Problem Statement

Manual fact-checking is slow and cannot scale. The paper asks: can large LLMs (GPT-3.5, GPT-4) act as fact-checking agents that retrieve web evidence, explain reasoning, and match human verdicts reliably across languages?

Main Contribution

Built an LLM agent that iteratively queries Google, gathers previews, and cites sources using the ReAct framework and LangChain.

Systematic comparison of GPT-3.5 vs GPT-4 on PolitiFact (3,000 sampled claims) and a multilingual Data Commons dump, with and without web context.

Key Findings

Providing web context improves fact-check accuracy.

NumbersContext raises accuracy to >80% for clear cases; no-context 6375% avg

Practical UseAlways give LLMs retrieved evidence when fact-checking; expect substantially better results on clear true/false claims.

Evidence RefAbstract; Conclusion; Results (PolitiFact & multilingual tables)

GPT-4 outperforms GPT-3.5 for fact-checking.

NumbersContext boosts true-category accuracy for GPT-4 by ~10.2 percentage points on average

Practical UsePrefer GPT-4 (or newer) when accuracy matters; GPT-3.5 is cheaper but less reliable on nuanced cases.

Evidence RefSection 3.1 (Table 1; text noting 10.19 pp gain)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy6375% (average across models/datasets as reported)PolitiFact & Data Commons (no-context)Conclusion: 'without contextual information, GPT-3.5 and GPT-4 demonstrate good performance of 63-75% accuracy on average'Conclusion
Accuracy8089% (for non-ambiguous verdicts with context)6375% (no-context)+~1525 pp (varies by case)PolitiFact (context condition)Conclusion: 'improves to above 80% and 89% for non-ambiguous verdicts when context is incorporated'Conclusion; Section 3.1

What To Try In 7 Days

Prototype a workflow where an LLM agent fetches web previews and returns a coarse true/false verdict for triage.

Add translation-to-English step for non-English inputs before sending to the model and measure accuracy lift.

Log the model's queries and cited domains so editors can quickly audit and correct verdicts.

Agent Features

Memory
short-term: accumulates search results per claimno persistent long-term memory reported
Planning
iterative web search (up to 3 iterations)decide to stop search and return verdict
Tool Use
Google Search APIBM25 for local distillationLangChain for orchestrationReAct for reasoning+action
Frameworks
ReActLangChain
Is Agentic

Yes

Architectures
Transformer LLM (GPT family)
Collaboration
returns reasoning and cited domains so humans can audit

Optimization Features

Token Efficiency
used 16k context window and filtered site previews to fit context

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

PolitiFact dataset (Misra, Politifact Fact Check Dataset 2022)Data Commons fact-check dump (Data Commons)

Risks & Boundaries

Limitations

Risk of data leakage: training data may contain prior fact-checks.

Performance varies widely by language and by how verdicts are labeled.

When Not To Use

For fully automated high-stakes decisions without human review.

When verdict nuance matters (legal, medical, or finely graded claims).

Failure Modes

Gives incorrect fine-grained verdicts while sounding confident.

Biased or incomplete search results lead to wrong conclusions.

Core Entities

Models

GPT-3.5GPT-4

Metrics

AccuracyF1

Datasets

PolitiFact (sampled 3,000 claims)Data Commons multilingual fact-check dump

Context Entities

Models

GPT-3.5GPT-4

Metrics

AccuracyF1 over languages

Datasets

PolitiFactData Commons fact-check dump