LLM agents can help fact-checking but need context, translations, and human oversight

Overview

Decision SnapshotNeeds Validation

Paper provides clear experiments on public datasets and documents the agent's queries and citations, but results depend on Google search quality and translation; expect moderate engineering effort to deploy safely.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Dorian Quelle, Alexandre Bovet

Links

Abstract / PDF / Data

Why It Matters For Business

LLM agents can speed fact-check workflows and triage clear misinformation, but they remain fallible on nuance and multilingual claims so human oversight is required.

Who Should Care

Product Manager CEO ML Engineer Founder Data Scientist

Summary TLDR

The authors evaluate GPT-3.5 and GPT-4 as fact-checking agents that search the web (via Google) and explain their reasoning. Contextual evidence (search results) boosts accuracy, GPT-4 outperforms GPT-3.5, and translating non-English claims into English usually improves results. Models handle clear false claims well but struggle with ambiguous "half-true" or graded labels. Data leakage from training data is a risk but did not cause a drop in post-cutoff performance in these tests. The agent design records queries and cites sources so humans can audit verdicts.

Problem Statement

Manual fact-checking is slow and cannot scale. The paper asks: can large LLMs (GPT-3.5, GPT-4) act as fact-checking agents that retrieve web evidence, explain reasoning, and match human verdicts reliably across languages?

Main Contribution

Built an LLM agent that iteratively queries Google, gathers previews, and cites sources using the ReAct framework and LangChain.

Systematic comparison of GPT-3.5 vs GPT-4 on PolitiFact (3,000 sampled claims) and a multilingual Data Commons dump, with and without web context.

Key Findings

Providing web context improves fact-check accuracy.

NumbersContext raises accuracy to >80% for clear cases; no-context 63–75% avg

Practical UseAlways give LLMs retrieved evidence when fact-checking; expect substantially better results on clear true/false claims.

Evidence RefAbstract; Conclusion; Results (PolitiFact & multilingual tables)

GPT-4 outperforms GPT-3.5 for fact-checking.

NumbersContext boosts true-category accuracy for GPT-4 by ~10.2 percentage points on average

Practical UsePrefer GPT-4 (or newer) when accuracy matters; GPT-3.5 is cheaper but less reliable on nuanced cases.

Evidence RefSection 3.1 (Table 1; text noting 10.19 pp gain)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	63–75% (average across models/datasets as reported)	—	—	PolitiFact & Data Commons (no-context)	Conclusion: 'without contextual information, GPT-3.5 and GPT-4 demonstrate good performance of 63-75% accuracy on average'	Conclusion
Accuracy	80–89% (for non-ambiguous verdicts with context)	63–75% (no-context)	+~15–25 pp (varies by case)	PolitiFact (context condition)	Conclusion: 'improves to above 80% and 89% for non-ambiguous verdicts when context is incorporated'	Conclusion; Section 3.1

What To Try In 7 Days

Prototype a workflow where an LLM agent fetches web previews and returns a coarse true/false verdict for triage.

Add translation-to-English step for non-English inputs before sending to the model and measure accuracy lift.

Log the model's queries and cited domains so editors can quickly audit and correct verdicts.

Agent Features

Memory

short-term: accumulates search results per claimno persistent long-term memory reported

Planning

iterative web search (up to 3 iterations)decide to stop search and return verdict

Tool Use

Google Search APIBM25 for local distillationLangChain for orchestrationReAct for reasoning+action

Frameworks

ReActLangChain

Is Agentic

Yes

Architectures

Transformer LLM (GPT family)

Collaboration

returns reasoning and cited domains so humans can audit

Optimization Features

Token Efficiency

used 16k context window and filtered site previews to fit context

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

PolitiFact dataset (Misra, Politifact Fact Check Dataset 2022)Data Commons fact-check dump (Data Commons)

Risks & Boundaries

Limitations

Risk of data leakage: training data may contain prior fact-checks.

Performance varies widely by language and by how verdicts are labeled.

When Not To Use

For fully automated high-stakes decisions without human review.

When verdict nuance matters (legal, medical, or finely graded claims).

Failure Modes

Gives incorrect fine-grained verdicts while sounding confident.

Biased or incomplete search results lead to wrong conclusions.

Core Entities

Models

GPT-3.5GPT-4

Metrics

AccuracyF1

Datasets

PolitiFact (sampled 3,000 claims)Data Commons multilingual fact-check dump

Context Entities

Models

GPT-3.5GPT-4

Metrics

AccuracyF1 over languages

Datasets

PolitiFactData Commons fact-check dump

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Providing web context improves fact-check accuracy.

GPT-4 outperforms GPT-3.5 for fact-checking.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding