Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
LLM agents can speed fact-check workflows and triage clear misinformation, but they remain fallible on nuance and multilingual claims so human oversight is required.
Summary TLDR
The authors evaluate GPT-3.5 and GPT-4 as fact-checking agents that search the web (via Google) and explain their reasoning. Contextual evidence (search results) boosts accuracy, GPT-4 outperforms GPT-3.5, and translating non-English claims into English usually improves results. Models handle clear false claims well but struggle with ambiguous "half-true" or graded labels. Data leakage from training data is a risk but did not cause a drop in post-cutoff performance in these tests. The agent design records queries and cites sources so humans can audit verdicts.
Problem Statement
Manual fact-checking is slow and cannot scale. The paper asks: can large LLMs (GPT-3.5, GPT-4) act as fact-checking agents that retrieve web evidence, explain reasoning, and match human verdicts reliably across languages?
Main Contribution
Built an LLM agent that iteratively queries Google, gathers previews, and cites sources using the ReAct framework and LangChain.
Systematic comparison of GPT-3.5 vs GPT-4 on PolitiFact (3,000 sampled claims) and a multilingual Data Commons dump, with and without web context.
Showed translation to English often raises accuracy for non-English claims and examined time trends to probe data-leakage concerns.
Key Findings
Providing web context improves fact-check accuracy.
GPT-4 outperforms GPT-3.5 for fact-checking.
Translating non-English claims to English usually helps.
Models struggle with ambiguous/granular verdicts.
Results
Accuracy
Accuracy
Translation impact (example)
Accuracy
Who Should Care
What To Try In 7 Days
Prototype a workflow where an LLM agent fetches web previews and returns a coarse true/false verdict for triage.
Add translation-to-English step for non-English inputs before sending to the model and measure accuracy lift.
Log the model's queries and cited domains so editors can quickly audit and correct verdicts.
Agent Features
Memory
- short-term: accumulates search results per claim
- no persistent long-term memory reported
Planning
- iterative web search (up to 3 iterations)
- decide to stop search and return verdict
Tool Use
- Google Search API
- BM25 for local distillation
- LangChain for orchestration
- ReAct for reasoning+action
Frameworks
- ReAct
- LangChain
Is Agentic
true
Architectures
- Transformer LLM (GPT family)
Collaboration
- returns reasoning and cited domains so humans can audit
Optimization Features
Token Efficiency
- used 16k context window and filtered site previews to fit context
Reproducibility
Data Urls
- PolitiFact dataset (Misra, Politifact Fact Check Dataset 2022)
- Data Commons fact-check dump (Data Commons)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Risk of data leakage: training data may contain prior fact-checks.
- Performance varies widely by language and by how verdicts are labeled.
- Poor at fine-grained categories like 'half-true' or 'mostly-false'.
- Relies on Google previews; full-page content was too large for context window.
- No public code or end-to-end deployment recipes provided.
When Not To Use
- For fully automated high-stakes decisions without human review.
- When verdict nuance matters (legal, medical, or finely graded claims).
- In low-resource languages without reliable translation support.
Failure Modes
- Gives incorrect fine-grained verdicts while sounding confident.
- Biased or incomplete search results lead to wrong conclusions.
- Fails on ambiguous claims that need deeper evidence synthesis.
- May repeat prior fact-check labels due to training data memorization.
Core Entities
Models
- GPT-3.5
- GPT-4
Metrics
- Accuracy
- F1
Datasets
- PolitiFact (sampled 3,000 claims)
- Data Commons multilingual fact-check dump
Context Entities
Models
- GPT-3.5
- GPT-4
Metrics
- Accuracy
- F1 over languages
Datasets
- PolitiFact
- Data Commons fact-check dump

