Overview
Paper provides clear experiments on public datasets and documents the agent's queries and citations, but results depend on Google search quality and translation; expect moderate engineering effort to deploy safely.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
LLM agents can speed fact-check workflows and triage clear misinformation, but they remain fallible on nuance and multilingual claims so human oversight is required.
Who Should Care
Summary TLDR
The authors evaluate GPT-3.5 and GPT-4 as fact-checking agents that search the web (via Google) and explain their reasoning. Contextual evidence (search results) boosts accuracy, GPT-4 outperforms GPT-3.5, and translating non-English claims into English usually improves results. Models handle clear false claims well but struggle with ambiguous "half-true" or graded labels. Data leakage from training data is a risk but did not cause a drop in post-cutoff performance in these tests. The agent design records queries and cites sources so humans can audit verdicts.
Problem Statement
Manual fact-checking is slow and cannot scale. The paper asks: can large LLMs (GPT-3.5, GPT-4) act as fact-checking agents that retrieve web evidence, explain reasoning, and match human verdicts reliably across languages?
Main Contribution
Built an LLM agent that iteratively queries Google, gathers previews, and cites sources using the ReAct framework and LangChain.
Systematic comparison of GPT-3.5 vs GPT-4 on PolitiFact (3,000 sampled claims) and a multilingual Data Commons dump, with and without web context.
Key Findings
Providing web context improves fact-check accuracy.
GPT-4 outperforms GPT-3.5 for fact-checking.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 63–75% (average across models/datasets as reported) | — | — | PolitiFact & Data Commons (no-context) | Conclusion: 'without contextual information, GPT-3.5 and GPT-4 demonstrate good performance of 63-75% accuracy on average' | Conclusion |
| Accuracy | 80–89% (for non-ambiguous verdicts with context) | 63–75% (no-context) | +~15–25 pp (varies by case) | PolitiFact (context condition) | Conclusion: 'improves to above 80% and 89% for non-ambiguous verdicts when context is incorporated' | Conclusion; Section 3.1 |
What To Try In 7 Days
Prototype a workflow where an LLM agent fetches web previews and returns a coarse true/false verdict for triage.
Add translation-to-English step for non-English inputs before sending to the model and measure accuracy lift.
Log the model's queries and cited domains so editors can quickly audit and correct verdicts.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Risk of data leakage: training data may contain prior fact-checks.
Performance varies widely by language and by how verdicts are labeled.
When Not To Use
For fully automated high-stakes decisions without human review.
When verdict nuance matters (legal, medical, or finely graded claims).
Failure Modes
Gives incorrect fine-grained verdicts while sounding confident.
Biased or incomplete search results lead to wrong conclusions.

