FiDeLiS: narrow retrieval + deductive beam search to ground LLM answers in KG paths

Overview

Decision SnapshotReady For Pilot

Practical for KG-backed QA: it increases answer fidelity and lowers runtime vs agentic baselines, but still relies on many LLM calls and KG coverage; expect moderate engineering effort to index KGs and tune beam parameters.

Citations3

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yuan Sui, Yufei He, Nian Liu, Xiaoxin He, Kun Wang, Bryan Hooi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FiDeLiS improves factual QA without model retraining by combining KG retrieval and stepwise logic checks, raising answer accuracy and cutting runtime—useful where auditability and verifiable facts matter.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

FiDeLiS is a training-free framework that grounds LLM answers in verifiable knowledge-graph (KG) reasoning paths. It has two parts: Path-RAG, which preselects a small set of KG entity/relation candidates using dense embeddings + graph connectivity; and DVBS, a beam search guided by LLM planning with stepwise deductive verification to stop when the question is provably answered. Across KGQA benchmarks (WebQSP, CWQ, CR-LT) FiDeLiS raises accuracy vs prior KG+LLM methods and cuts runtime vs agentic baselines, while producing shorter, more verifiable reasoning paths. Code is released.

Problem Statement

LLMs often hallucinate or produce invalid multi-step reasoning. Knowledge graphs can anchor reasoning but prior approaches either (a) retrieve imprecise facts or miss graph structure, or (b) treat LLMs as agents and incur high latency. The field needs a method that is both faithful (verifiable steps) and efficient (limited LLM calls and search). RoG's analysis shows only 67% of generated reasoning steps were valid, leaving room for improvement.

Main Contribution

FiDeLiS: a training-free pipeline that combines Path-RAG (retrieval) and Deductive-Verification Beam Search (DVBS) to ground LLM answers in KG paths.

Path-RAG narrows KG search by combining semantic similarity with one-hop structural connectivity to preselect high-quality candidates.

Key Findings

FiDeLiS improves top-answer accuracy on WebQSP with strong LLMs.

NumbersWebQSP Hits@1 84.39% (FiDeLiS, GPT‑4‑turbo) vs 81.84% (ToG)

Practical UseIf you run KG-backed QA with a powerful LLM, plug in FiDeLiS to gain a few points of Hits@1 without model finetuning.

Evidence RefTable 1

Path-RAG meaningfully increases retrieval quality vs vanilla retrievers.

NumbersWebQSP Hits@1 79.32% (Path‑RAG) vs 72.35% (vanilla retriever, same embeddings)

Practical UsePre-filter KG candidates by combining embeddings and graph neighbors to boost answer recall and reduce search work.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WebQSP Hits@1	84.39% (FiDeLiS, GPT‑4‑turbo)	81.84% (ToG, GPT‑4‑turbo)	+2.55pp	WebQSP	Table 1: FiDeLiS vs ToG on WebQSP	Table 1
CWQ Hits@1	71.47% (FiDeLiS, GPT‑4‑turbo)	68.51% (ToG, GPT‑4‑turbo)	+2.96pp	CWQ	Table 1: FiDeLiS vs ToG on CWQ	Table 1

What To Try In 7 Days

Run FiDeLiS end-to-end on a small KGQA subset to compare Hits@1 and runtime versus your current pipeline.

Swap your retriever for Path‑RAG (embed entities+relations and add one‑hop scoring) to reduce candidate load.

Add a simple deductive-check prompt to stop reasoning when the answer can be logically deduced.

Agent Features

Memory

pre-embedded KG vectors in nearest-neighbor index

Planning

LLM-generated planning steps (plan prompts to guide search)

Tool Use

LLM as a decision agent for step selectionKG nearest-neighbor retrieval index

Frameworks

Path-RAGDVBSFiDeLiS

Is Agentic

Yes

Architectures

beam searchretrieval-augmented generation

Collaboration

retriever + LLM decision loop

Optimization Features

Token Efficiency

Reduced average token usage vs ToG (see Table 7)

System Optimization

precompute embeddings for entities and relationsuse nearest-neighbor index for fast retrieval

Training Optimization

training-free design (no finetuning required)

Inference Optimization

Path‑RAG candidate pruningdeductive verification that enables early stoppingbeam pruning to keep top-K paths

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Y-Sui/FiDeLiS

Data URLs

Freebase subgraphs (used for WebQSP/CWQ, as in RoG)Wikidata (used for CR-LT-KGQA)

Risks & Boundaries

Limitations

Performance depends on KG quality and coverage; missing or outdated KG facts can still cause failures.

Beam search still requires multiple LLM calls; latency remains non-trivial for deep multi-hop cases.

When Not To Use

When no structured KG is available or building one is infeasible.

For ultra-low-latency real-time services where tens of seconds per query are unacceptable.

Failure Modes

Incorrect deductive-verifier judgments lead to premature stopping or false positives.

Path-RAG may still miss correct candidates if KG lacks relation labels or entity surface forms.

Core Entities

Models

gpt-4-turbogpt-3.5-turbogpt-4ogpt-4o-minillama-2-13bmixtral-7bmistral-7b

Metrics

Hits@1F1AccuracyRuntime (s/question)Token usageCoverage Ratio (CR)Validity Ratio (VR)

Datasets

WebQSPComplex WebQuestions (CWQ)CR-LT-KGQAMedQA-USMIE (subset for robustness)

Benchmarks

KGQA (WebQSP, CWQ, CR-LT-KGQA)

Context Entities

Models

OpenAI text-embedding-3-smallSentenceBERTE5BM25

Datasets

Freebase (subgraphs)Wikidata (subgraphs)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FiDeLiS improves top-answer accuracy on WebQSP with strong LLMs.

Path-RAG meaningfully increases retrieval quality vs vanilla retrievers.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding