FiDeLiS: narrow retrieval + deductive beam search to ground LLM answers in KG paths

May 22, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

3

Authors

Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, Bryan Hooi

Links

Abstract / PDF

Why It Matters For Business

FiDeLiS improves factual QA without model retraining by combining KG retrieval and stepwise logic checks, raising answer accuracy and cutting runtime—useful where auditability and verifiable facts matter.

Summary TLDR

FiDeLiS is a training-free framework that grounds LLM answers in verifiable knowledge-graph (KG) reasoning paths. It has two parts: Path-RAG, which preselects a small set of KG entity/relation candidates using dense embeddings + graph connectivity; and DVBS, a beam search guided by LLM planning with stepwise deductive verification to stop when the question is provably answered. Across KGQA benchmarks (WebQSP, CWQ, CR-LT) FiDeLiS raises accuracy vs prior KG+LLM methods and cuts runtime vs agentic baselines, while producing shorter, more verifiable reasoning paths. Code is released.

Problem Statement

LLMs often hallucinate or produce invalid multi-step reasoning. Knowledge graphs can anchor reasoning but prior approaches either (a) retrieve imprecise facts or miss graph structure, or (b) treat LLMs as agents and incur high latency. The field needs a method that is both faithful (verifiable steps) and efficient (limited LLM calls and search). RoG's analysis shows only 67% of generated reasoning steps were valid, leaving room for improvement.

Main Contribution

FiDeLiS: a training-free pipeline that combines Path-RAG (retrieval) and Deductive-Verification Beam Search (DVBS) to ground LLM answers in KG paths.

Path-RAG narrows KG search by combining semantic similarity with one-hop structural connectivity to preselect high-quality candidates.

DVBS uses LLM planning plus local/global deductive checks to validate each step and stop when the question is deducible, improving factuality and interpretability.

Key Findings

FiDeLiS improves top-answer accuracy on WebQSP with strong LLMs.

NumbersWebQSP Hits@1 84.39% (FiDeLiS, GPT‑4‑turbo) vs 81.84% (ToG)

Path-RAG meaningfully increases retrieval quality vs vanilla retrievers.

NumbersWebQSP Hits@1 79.32% (Path‑RAG) vs 72.35% (vanilla retriever, same embeddings)

Deductive verification shortens reasoning to be closer to ground truth depth.

NumbersAverage path depth WebQSP: GT 2.3, ToG 3.1, FiDeLiS 2.4

Beam search is essential for quality; removing it collapses accuracy.

NumbersWebQSP Hits@1 drops 79.32% → 60.35% when beam-search removed

FiDeLiS reduces runtime vs an agentic baseline by roughly 1.7×.

NumbersAvg runtime per question 43.83s (FiDeLiS) vs 74.26s (ToG without Path‑RAG)

Results

WebQSP Hits@1

Value84.39% (FiDeLiS, GPT‑4‑turbo)

Baseline81.84% (ToG, GPT‑4‑turbo)

CWQ Hits@1

Value71.47% (FiDeLiS, GPT‑4‑turbo)

Baseline68.51% (ToG, GPT‑4‑turbo)

Accuracy

Value72.12% (FiDeLiS, GPT‑4‑turbo)

Baseline67.24% (ToG, GPT‑4‑turbo)

Runtime per question

Value43.83s (FiDeLiS, WebQSP default)

Baseline74.26s (ToG without Path-RAG)

Ablation: remove beam search

Value60.35% Hits@1 (no beam)

Baseline79.32% Hits@1 (full FiDeLiS)

Who Should Care

What To Try In 7 Days

Run FiDeLiS end-to-end on a small KGQA subset to compare Hits@1 and runtime versus your current pipeline.

Swap your retriever for Path‑RAG (embed entities+relations and add one‑hop scoring) to reduce candidate load.

Add a simple deductive-check prompt to stop reasoning when the answer can be logically deduced.

Agent Features

Memory

  • pre-embedded KG vectors in nearest-neighbor index

Planning

  • LLM-generated planning steps (plan prompts to guide search)

Tool Use

  • LLM as a decision agent for step selection
  • KG nearest-neighbor retrieval index

Frameworks

  • Path-RAG
  • DVBS
  • FiDeLiS

Is Agentic

true

Architectures

  • beam search
  • retrieval-augmented generation

Collaboration

  • retriever + LLM decision loop

Optimization Features

Token Efficiency

  • Reduced average token usage vs ToG (see Table 7)

System Optimization

  • precompute embeddings for entities and relations
  • use nearest-neighbor index for fast retrieval

Training Optimization

  • training-free design (no finetuning required)

Inference Optimization

  • Path‑RAG candidate pruning
  • deductive verification that enables early stopping
  • beam pruning to keep top-K paths

Reproducibility

Data Urls

  • Freebase subgraphs (used for WebQSP/CWQ, as in RoG)
  • Wikidata (used for CR-LT-KGQA)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on KG quality and coverage; missing or outdated KG facts can still cause failures.
  • Beam search still requires multiple LLM calls; latency remains non-trivial for deep multi-hop cases.
  • Deductive verification relies on LLM judgment prompts and can inherit LLM bias or mistakes.
  • Tuned hyperparameters (α, beam width/depth) need dataset-specific calibration.

When Not To Use

  • When no structured KG is available or building one is infeasible.
  • For ultra-low-latency real-time services where tens of seconds per query are unacceptable.
  • For open-ended free-text answers that cannot be expressed as KG path derivations.

Failure Modes

  • Incorrect deductive-verifier judgments lead to premature stopping or false positives.
  • Path-RAG may still miss correct candidates if KG lacks relation labels or entity surface forms.
  • Beam search width/depth misconfiguration either wastes resources or misses paths.

Core Entities

Models

  • gpt-4-turbo
  • gpt-3.5-turbo
  • gpt-4o
  • gpt-4o-mini
  • llama-2-13b
  • mixtral-7b
  • mistral-7b

Metrics

  • Hits@1
  • F1
  • Accuracy
  • Runtime (s/question)
  • Token usage
  • Coverage Ratio (CR)
  • Validity Ratio (VR)

Datasets

  • WebQSP
  • Complex WebQuestions (CWQ)
  • CR-LT-KGQA
  • MedQA-USMIE (subset for robustness)

Benchmarks

  • KGQA (WebQSP, CWQ, CR-LT-KGQA)

Context Entities

Models

  • OpenAI text-embedding-3-small
  • SentenceBERT
  • E5
  • BM25

Datasets

  • Freebase (subgraphs)
  • Wikidata (subgraphs)