RoG: Ground LLM plans on knowledge‑graph relation paths for faithful, interpretable KGQA

Overview

Decision SnapshotNeeds Validation

RoG shows clear gains for KG-backed QA by forcing LLMs to plan with relation paths and reason over retrieved KG instances; it is practical when a high-quality KG and entity linking exist, but requires KG preprocessing and moderate GPU resources.

Citations38

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, Shirui Pan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RoG reduces hallucinations by grounding LLM reasoning in KG facts and provides traceable, human-readable paths—this improves accuracy and trust on KG-backed QA without retraining every LLM.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

RoG is a planning–retrieval–reasoning method that forces LLMs to produce relation-path plans grounded in a knowledge graph (KG), retrieves matching reasoning paths from the KG, and then uses those paths as context for final answers. The method is trained by two instruction-tuning tasks (planning and retrieval-reasoning), is plug-and-play at inference (the planning module can be used with other LLMs), and yields state-of-the-art results on KGQA benchmarks (WebQSP and CWQ) while producing human-readable, KG-grounded explanations. Code and weights are released.

Problem Statement

LLMs hallucinate and lack up-to-date facts during multi-hop reasoning. Prior KG+LLM approaches either generate brittle logical queries or treat KGs as loose text stores and ignore KG structure. We need a way to make LLM reasoning faithful to KG facts and interpretable by exposing KG relation paths.

Main Contribution

Introduce RoG: a planning–retrieval–reasoning pipeline that uses relation paths as KG-grounded plans.

Two-task instruction tuning: (1) planning optimization to teach LLMs to output KG relation paths, (2) retrieval-reasoning optimization to make LLMs reason over retrieved KG paths.

Key Findings

RoG sets new best scores on standard KGQA benchmarks.

NumbersWebQSP Hits@1 85.7; F1 70.8. CWQ Hits@1 62.6; F1 56.2.

Practical UseIf you have a KG-backed QA task, RoG gives higher answer accuracy and more interpretable outputs than prior KGQA/LLM baselines on evaluated datasets.

Evidence RefTable 1

Grounding LLM plans with KG relation paths markedly improves off-the-shelf LLMs.

NumbersChatGPT Hits@1 66.77 → +RoG 81.51; Flan‑T5 30.95 → +RoG 67.87.

Practical UseYou can boost existing LLMs quickly by supplying KG-derived reasoning paths as context, without full model retraining.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hits@1	85.7	DECAF 82.1 (reported)	+4.4% rel	WebQSP	RoG Hits@1 85.7 vs DECAF 82.1 (Table 1)	Table 1
Hits@1	62.6	UniKGQA 51.2	+22.3% rel	CWQ	RoG 62.6 vs UniKGQA 51.2 (Table 1)	Table 1

What To Try In 7 Days

Run RoG planning module to generate relation-path plans from a small company KG and feed retrieved paths to your production LLM to compare answer accuracy.

Benchmark K=1..5 to find the retrieval size that balances latency and precision for your use case (paper uses K=3).

Fine-tune a LLaMA2-style model on your KG QA pairs with RoG’s planning + retrieval-reasoning tasks to create a faster transfer model for similar KGs.

Agent Features

Memory

uses knowledge graph as external factual memory

Planning

relation-path planning (KG-grounded plan generation)

Tool Use

constrained BFS KG retrievalFiD fusion for multi-path reasoning

Frameworks

planning–retrieval–reasoning pipeline

Architectures

decoder-only LLM (LLaMA2-Chat-7B)

Optimization Features

Token Efficiency

not explicitly optimized

Infra Optimization

trained on 2x A100-80G GPUs for 38 hours (Freebase)

Model Optimization

instruction fine-tuning on planning and reasoning tasks

System Optimization

transfer finetuning to new KG only 2 hours after base training

Training Optimization

joint training on two tasks (planning and retrieval-reasoning)batch size 4, lr 2e-5, cosine scheduler, 3 epochs

Inference Optimization

beam search to generate top-K relation paths (K=3)constrained BFS retrieval then FiD-style reasoning

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/RManLuo/reasoning-on-graphs

Data URLs

https://github.com/RManLuo/reasoning-on-graphs (training data and subgraph construction described in Appendix)

Risks & Boundaries

Limitations

Requires a linked KG and correct entity linking; missing or wrong KG facts reduce effectiveness.

Retrieval cost grows with number of plans; latency and noise increase for large K.

When Not To Use

You have no suitable KG or entity linking pipeline.

You need very low-latency responses and cannot afford constrained-BFS retrieval.

Failure Modes

Noisy or irrelevant retrieved paths increase false positives and lower precision.

LLM may still generate incorrect plans if not sufficiently fine-tuned on relation names.

Core Entities

Models

RoGLLaMA2-Chat-7BChatGPTFlan-T5-xlAlpaca-7B

Metrics

Hits@1F1PrecisionRecall

Datasets

WebQSPComplex WebQuestions (CWQ)MetaQA-3hopFreebaseWiki-Movies (subset for MetaQA)

Benchmarks

KGQA (WebQSP, CWQ)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RoG sets new best scores on standard KGQA benchmarks.

Grounding LLM plans with KG relation paths markedly improves off-the-shelf LLMs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding