An LLM agent that first pulls subgraphs from Wikidata, then triggers focused web search and prompt-based self-improvement to improve fact‑f​

February 27, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.5

Citation Count

0

Authors

Shuzhi Gong, Richard O. Sinnott, Jianzhong Qi, Cecile Paris, Preslav Nakov, Zhuohan Xie

Links

Abstract / PDF

Why It Matters For Business

WKGFC lets products make more accurate, explainable fact checks by combining compact KG evidence with targeted web search and lightweight policy tuning, improving verification accuracy without retraining large LLMs.

Summary TLDR

This paper presents WKGFC: an agentic fact‑checking system that treats evidence gathering as a sequential decision problem. It retrieves a compact subgraph from open KGs (Wikidata) using an expand‑and‑prune LLM-guided beam search, and, when needed, triggers coarse‑to‑fine web retrieval that is converted into KG triplets. The agent learns better retrieval behavior by storing episode critiques and optimizing its prompt (TextGrad) while keeping the base LLM frozen. On mixed benchmarks (Wikipedia, web, and gold-evidence), WKGFC reports an overall balanced accuracy of 74.3%, improving ~+5.4 points over the best baseline (FIRE).

Problem Statement

Text-only retrieval often misses multi-hop factual links and returns passages that are semantically similar but not factually relevant. Pure KG methods give precise relations but lack coverage in open-world claims. Existing systems also lack an adaptive procedure to decide when to expand KG vs. search the web.

Main Contribution

Formulate fact-checking as a POMDP agent that adaptively chooses KG expansion, web search, or verdict.

A KG-first expand-and-prune retrieval pipeline: seed entities from claims, SPARQL expansion, LLM-guided pruning (beam search).

A web module that does coarse BM25 retrieval, LLM filtering, and converts selected passages into KG triplets to fuse with the KG.

Self-reflection + prompt optimization (TextGrad) to improve retrieval/stop policies without updating LLM weights.

Extensive evaluation showing consistent gains across Wikipedia, web-sourced, and gold-evidence benchmarks.

Key Findings

WKGFC improves overall balanced accuracy compared to strong baselines.

NumbersOverall avg: WKGFC 74.3% vs FIRE 68.9% (+5.4)

Strong gains on Wikipedia single-hop verification.

NumbersFEVER: WKGFC 91.9% (best reported)

Agentic adaptation helps open-web claims where evidence is scattered.

NumbersLIAR-New: WKGFC 81.3% vs HerO 70.2% (+11.1)

All modules contribute; removing web or agent lowers accuracy.

NumbersFEVER: KG-only 72.3% -> full WKGFC 91.9% (+19.6)

Search breadth vs. depth trade-off: more beam size and hops help multi-hop tasks but raise cost.

NumbersParameter sweep finds FEVER saturates at beam k=4; HOVER benefits from N≥4 hops

Results

Accuracy

ValueWKGFC 74.3%

BaselineFIRE 68.9%

Accuracy

ValueWKGFC 91.9%

BaselineFIRE 90.6%

Accuracy

ValueWKGFC 81.3%

BaselineHerO 70.2%

Ablation: KG-only vs full

ValueKG-only FEVER 72.3% -> WKGFC 91.9%

Who Should Care

What To Try In 7 Days

Prototype a KG-first retrieval step using Wikidata SPARQL and spaCy entity linking on a small claim set.

Add a coarse BM25 web fetch plus an LLM filter to accept/reject passages.

Store decision traces and manually refine the prompt that decides when to expand or stop retrieval before automating prompt tuning.

Agent Features

Memory

  • experience buffer of trajectories and structured self-critiques
  • prompt as policy parameter (trainable)

Planning

  • sequential retrieval actions (initKGRetrieval, expandKG, webSearch, verdict)
  • beam search expansion with LLM pruning

Tool Use

  • initKGRetrieval(claim)
  • expandKG(claim,currentKG,topicEntities)
  • webSearch(query,currentKG)
  • verdict(claim,G,K_web)

Frameworks

  • LLM-enabled retrieval control
  • prompt-level optimization (TextGrad)

Is Agentic

true

Architectures

  • POMDP / LLM agent
  • expand-and-prune beam search over KG

Collaboration

  • single agent decision loop (paper notes future multi-agent extension)

Optimization Features

Token Efficiency

  • pruning and stopping policy reduce extra LLM calls compared to blind multi-round retrieval

System Optimization

  • expand-and-prune KG traversal to control graph growth

Training Optimization

  • prompt-level optimization using TextGrad over an experience buffer (no LLM weight updates)

Inference Optimization

  • adaptive stopping decision reduces unnecessary retrieval calls
  • beam-pruning limits KG size

Reproducibility

Data Urls

  • FEVER
  • HOVER
  • LIAR-New
  • AveriTeC
  • SummEval
  • AggreFact-CNN
  • PubHealth

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Relies on KG coverage: missing facts in Wikidata cause many errors on non‑wiki claims.
  • Web retrieval can be noisy; web evidence is treated as lower‑precision expansions.
  • Computational cost: SPARQL queries and many LLM calls scale with beam size and hop depth.
  • Evaluation uses a curated protocol (binary labels, removed NEI), which may hide edge cases.

When Not To Use

  • When latency or API cost forbids multiple LLM calls and SPARQL queries.
  • When the domain has no suitable KG coverage and web sources are also sparse.
  • When strict real-time constraints demand a single-shot judge without retrieval.

Failure Modes

  • Insufficient KG coverage: agent must trigger web search but evidence still missing.
  • Exceeding maximum steps: agent exhausts retrieval without confidence and is forced to guess.
  • Over-confidence: agent stops early on partial KG and returns wrong verdict.
  • Cost blow-up: deep beams and many hops increase SPARQL and LLM invocations.

Core Entities

Models

  • WKGFC (Ours)
  • GPT-4
  • GPT-4o
  • Claude 3.5-Sonnet
  • Gemini-2.5-flash
  • DeepSeek-V3 67B
  • Llama3 8B
  • Llama3.3 70B
  • Qwen2.5 7B
  • Qwen2.5 72B
  • HerO
  • FIRE
  • GraphRAG
  • GraphCheck

Metrics

  • Accuracy
  • error rate
  • neg rate

Datasets

  • FEVER
  • HOVER
  • LIAR-New
  • AveriTeC
  • SummEval
  • AggreFact-CNN
  • PubHealth

Benchmarks

  • Accuracy