Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
WKGFC lets products make more accurate, explainable fact checks by combining compact KG evidence with targeted web search and lightweight policy tuning, improving verification accuracy without retraining large LLMs.
Summary TLDR
This paper presents WKGFC: an agentic fact‑checking system that treats evidence gathering as a sequential decision problem. It retrieves a compact subgraph from open KGs (Wikidata) using an expand‑and‑prune LLM-guided beam search, and, when needed, triggers coarse‑to‑fine web retrieval that is converted into KG triplets. The agent learns better retrieval behavior by storing episode critiques and optimizing its prompt (TextGrad) while keeping the base LLM frozen. On mixed benchmarks (Wikipedia, web, and gold-evidence), WKGFC reports an overall balanced accuracy of 74.3%, improving ~+5.4 points over the best baseline (FIRE).
Problem Statement
Text-only retrieval often misses multi-hop factual links and returns passages that are semantically similar but not factually relevant. Pure KG methods give precise relations but lack coverage in open-world claims. Existing systems also lack an adaptive procedure to decide when to expand KG vs. search the web.
Main Contribution
Formulate fact-checking as a POMDP agent that adaptively chooses KG expansion, web search, or verdict.
A KG-first expand-and-prune retrieval pipeline: seed entities from claims, SPARQL expansion, LLM-guided pruning (beam search).
A web module that does coarse BM25 retrieval, LLM filtering, and converts selected passages into KG triplets to fuse with the KG.
Self-reflection + prompt optimization (TextGrad) to improve retrieval/stop policies without updating LLM weights.
Extensive evaluation showing consistent gains across Wikipedia, web-sourced, and gold-evidence benchmarks.
Key Findings
WKGFC improves overall balanced accuracy compared to strong baselines.
Strong gains on Wikipedia single-hop verification.
Agentic adaptation helps open-web claims where evidence is scattered.
All modules contribute; removing web or agent lowers accuracy.
Search breadth vs. depth trade-off: more beam size and hops help multi-hop tasks but raise cost.
Results
Accuracy
Accuracy
Accuracy
Ablation: KG-only vs full
Who Should Care
What To Try In 7 Days
Prototype a KG-first retrieval step using Wikidata SPARQL and spaCy entity linking on a small claim set.
Add a coarse BM25 web fetch plus an LLM filter to accept/reject passages.
Store decision traces and manually refine the prompt that decides when to expand or stop retrieval before automating prompt tuning.
Agent Features
Memory
- experience buffer of trajectories and structured self-critiques
- prompt as policy parameter (trainable)
Planning
- sequential retrieval actions (initKGRetrieval, expandKG, webSearch, verdict)
- beam search expansion with LLM pruning
Tool Use
- initKGRetrieval(claim)
- expandKG(claim,currentKG,topicEntities)
- webSearch(query,currentKG)
- verdict(claim,G,K_web)
Frameworks
- LLM-enabled retrieval control
- prompt-level optimization (TextGrad)
Is Agentic
true
Architectures
- POMDP / LLM agent
- expand-and-prune beam search over KG
Collaboration
- single agent decision loop (paper notes future multi-agent extension)
Optimization Features
Token Efficiency
- pruning and stopping policy reduce extra LLM calls compared to blind multi-round retrieval
System Optimization
- expand-and-prune KG traversal to control graph growth
Training Optimization
- prompt-level optimization using TextGrad over an experience buffer (no LLM weight updates)
Inference Optimization
- adaptive stopping decision reduces unnecessary retrieval calls
- beam-pruning limits KG size
Reproducibility
Data Urls
- FEVER
- HOVER
- LIAR-New
- AveriTeC
- SummEval
- AggreFact-CNN
- PubHealth
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Relies on KG coverage: missing facts in Wikidata cause many errors on non‑wiki claims.
- Web retrieval can be noisy; web evidence is treated as lower‑precision expansions.
- Computational cost: SPARQL queries and many LLM calls scale with beam size and hop depth.
- Evaluation uses a curated protocol (binary labels, removed NEI), which may hide edge cases.
When Not To Use
- When latency or API cost forbids multiple LLM calls and SPARQL queries.
- When the domain has no suitable KG coverage and web sources are also sparse.
- When strict real-time constraints demand a single-shot judge without retrieval.
Failure Modes
- Insufficient KG coverage: agent must trigger web search but evidence still missing.
- Exceeding maximum steps: agent exhausts retrieval without confidence and is forced to guess.
- Over-confidence: agent stops early on partial KG and returns wrong verdict.
- Cost blow-up: deep beams and many hops increase SPARQL and LLM invocations.
Core Entities
Models
- WKGFC (Ours)
- GPT-4
- GPT-4o
- Claude 3.5-Sonnet
- Gemini-2.5-flash
- DeepSeek-V3 67B
- Llama3 8B
- Llama3.3 70B
- Qwen2.5 7B
- Qwen2.5 72B
- HerO
- FIRE
- GraphRAG
- GraphCheck
Metrics
- Accuracy
- error rate
- neg rate
Datasets
- FEVER
- HOVER
- LIAR-New
- AveriTeC
- SummEval
- AggreFact-CNN
- PubHealth
Benchmarks
- Accuracy

