An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f

Overview

Decision SnapshotNeeds Validation

The paper treats retrieval as an explicit sequential decision problem and shows that prompt-level policy tuning improves retrieval coordination and stopping without changing LLM weights.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 65%

Authors

Shuzhi Gong, Richard O. Sinnott, Jianzhong Qi, Cecile Paris, Preslav Nakov, Zhuohan Xie

Links

Abstract / PDF / Data

Why It Matters For Business

WKGFC lets products make more accurate, explainable fact checks by combining compact KG evidence with targeted web search and lightweight policy tuning, improving verification accuracy without retraining large LLMs.

Who Should Care

Product Manager ML Engineer CTO Data Scientist Engineering Lead

Summary TLDR

This paper presents WKGFC: an agentic fact‑checking system that treats evidence gathering as a sequential decision problem. It retrieves a compact subgraph from open KGs (Wikidata) using an expand‑and‑prune LLM-guided beam search, and, when needed, triggers coarse‑to‑fine web retrieval that is converted into KG triplets. The agent learns better retrieval behavior by storing episode critiques and optimizing its prompt (TextGrad) while keeping the base LLM frozen. On mixed benchmarks (Wikipedia, web, and gold-evidence), WKGFC reports an overall balanced accuracy of 74.3%, improving ~+5.4 points over the best baseline (FIRE).

Problem Statement

Text-only retrieval often misses multi-hop factual links and returns passages that are semantically similar but not factually relevant. Pure KG methods give precise relations but lack coverage in open-world claims. Existing systems also lack an adaptive procedure to decide when to expand KG vs. search the web.

Main Contribution

Formulate fact-checking as a POMDP agent that adaptively chooses KG expansion, web search, or verdict.

A KG-first expand-and-prune retrieval pipeline: seed entities from claims, SPARQL expansion, LLM-guided pruning (beam search).

Key Findings

WKGFC improves overall balanced accuracy compared to strong baselines.

NumbersOverall avg: WKGFC 74.3% vs FIRE 68.9% (+5.4)

Practical UseCombining KG-first retrieval, targeted web search, and prompt-level policy tuning can measurably raise veracity prediction on mixed fact-checking benchmarks.

Evidence RefTable 1; Section 5.4

Strong gains on Wikipedia single-hop verification.

NumbersFEVER: WKGFC 91.9% (best reported)

Practical UseFor Wikipedia-like claims, prioritizing KG subgraph retrieval gives precise evidence and high accuracy; implement KG-first as a low-latency step.

Evidence RefTable 1; Section 5.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	WKGFC 74.3%	FIRE 68.9%	+5.4	Overall (Table 1)	Table 1; Section 5.4	Table 1
Accuracy	WKGFC 91.9%	FIRE 90.6%	+1.3	FEVER (Wikipedia-sourced)	Table 1; Section 5.4	Table 1

What To Try In 7 Days

Prototype a KG-first retrieval step using Wikidata SPARQL and spaCy entity linking on a small claim set.

Add a coarse BM25 web fetch plus an LLM filter to accept/reject passages.

Store decision traces and manually refine the prompt that decides when to expand or stop retrieval before automating prompt tuning.

Agent Features

Memory

experience buffer of trajectories and structured self-critiquesprompt as policy parameter (trainable)

Planning

sequential retrieval actions (initKGRetrieval, expandKG, webSearch, verdict)beam search expansion with LLM pruning

Tool Use

initKGRetrieval(claim)expandKG(claim,currentKG,topicEntities)webSearch(query,currentKG)verdict(claim,G,K_web)

Frameworks

LLM-enabled retrieval controlprompt-level optimization (TextGrad)

Is Agentic

Yes

Architectures

POMDP / LLM agentexpand-and-prune beam search over KG

Collaboration

single agent decision loop (paper notes future multi-agent extension)

Optimization Features

Token Efficiency

pruning and stopping policy reduce extra LLM calls compared to blind multi-round retrieval

System Optimization

expand-and-prune KG traversal to control graph growth

Training Optimization

prompt-level optimization using TextGrad over an experience buffer (no LLM weight updates)

Inference Optimization

adaptive stopping decision reduces unnecessary retrieval callsbeam-pruning limits KG size

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Data URLs

FEVERHOVERLIAR-NewAveriTeCSummEvalAggreFact-CNNPubHealth

Risks & Boundaries

Limitations

Relies on KG coverage: missing facts in Wikidata cause many errors on non‑wiki claims.

Web retrieval can be noisy; web evidence is treated as lower‑precision expansions.

When Not To Use

When latency or API cost forbids multiple LLM calls and SPARQL queries.

When the domain has no suitable KG coverage and web sources are also sparse.

Failure Modes

Insufficient KG coverage: agent must trigger web search but evidence still missing.

Exceeding maximum steps: agent exhausts retrieval without confidence and is forced to guess.

Core Entities

Models

WKGFC (Ours)GPT-4GPT-4oClaude 3.5-SonnetGemini-2.5-flashDeepSeek-V3 67BLlama3 8BLlama3.3 70BQwen2.5 7BQwen2.5 72BHerOFIREGraphRAGGraphCheck

Metrics

Accuracyerror rateneg rate

Datasets

FEVERHOVERLIAR-NewAveriTeCSummEvalAggreFact-CNNPubHealth

Benchmarks

Accuracy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

WKGFC improves overall balanced accuracy compared to strong baselines.

Strong gains on Wikipedia single-hop verification.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding