Use RAG + PCST to let LLMs 'chat' with very large textual graphs

Overview

Decision SnapshotNeeds Validation

The approach combines well-known components (sentence embeddings, k-NN, PCST, frozen LLM prompting). Results are consistent across three datasets and include human-checked hallucination metrics, but production tuning (trainable retrieval) and large-scale deployment remain to be tested.

Citations22

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, Bryan Hooi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need natural-language queries over large text-rich graphs, G-Retriever scales to huge graphs, speeds training and inference dramatically, and reduces wrong citations by returning the exact subgraph used to answer.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

G-Retriever is a retrieval-augmented generation (RAG) system that answers natural-language questions about graphs whose nodes and edges carry text. It encodes nodes/edges with SentenceBert, retrieves relevant pieces via k-NN, then extracts a connected subgraph using a Prize-Collecting Steiner Tree (PCST) algorithm. The subgraph is turned into text and fed to a frozen LLM (Llama2) with a learned graph token. Across three datasets (ExplaGraphs, SceneGraphs, WebQSP) it improves QA accuracy, drastically cuts token and compute costs, and reduces hallucinated citations of graph elements.

Problem Statement

LLMs struggle to answer questions about large textual graphs because (1) flattening whole graphs overflows the LLM context and loses info, and (2) LLMs hallucinate nodes/edges when they cannot access exact graph facts. The paper builds a retrieval pipeline tailored to general textual graphs and a benchmark to measure both QA and hallucination.

Main Contribution

GraphQA benchmark: standardized, multi-domain GraphQA built from ExplaGraphs, SceneGraphs, WebQSP for node/edge QA and multi‑hop questions.

G-Retriever: first RAG design for general textual graphs that returns a connected subgraph via PCST and feeds it as a soft graph prompt to a frozen LLM.

Key Findings

G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.

NumbersWebQSP: GraphToken 57.05% → G-Retriever 70.49% → G-Retriever+LoRA 73.79%

Practical UseUse graph-aware RAG plus soft prompting or LoRA to get large accuracy gains on multi-hop KG-style QA.

Evidence RefTable 3; Table 8

Graph-aware retrieval cuts textual graph size massively and speeds training: SceneGraphs tokens ↓83%, nodes ↓74%, time ↓29%; WebQSP tokens ↓99%, nodes ↓99%, time ↓67%.

NumbersSceneGraphs tokens -83%, nodes -74%, time -29%; WebQSP tokens -99%, nodes -99%, time -67%

Practical UseIf your graphs exceed the LLM context, retrieve a small connected subgraph to make LLM queries tractable and much faster.

Evidence RefTable 4; Section 6.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WebQSP Hit@1 (Frozen LLM w/ prompt tuning)	70.49 ± 1.21	GraphToken 57.05 ± 0.74	+13.44pp	WebQSP validation/test	Table 3; Section 6.2	Table 3
LoRA	73.79 ± 0.70	LoRA without G-Retriever 66.03 ± 0.47	+7.76pp	WebQSP validation/test	Table 3; Section 6.2	Table 3

What To Try In 7 Days

Index a small textual graph with SentenceBert, run k-NN, and extract a PCST subgraph to see token savings.

Add a single learned graph token (soft prompt) to a frozen LLM and feed the textualized subgraph plus question.

Manually evaluate 50 answers for cited node/edge correctness to measure hallucination reduction.

Optimization Features

Token Efficiency

PCST-based retrieval reduces tokenized graph size by up to 99% on WebQSP

Model Optimization

LoRA

System Optimization

Use k-NN on SentenceBert embeddings and PCST to cut compute and epoch time

Training Optimization

Prompt tuning for frozen LLMs to avoid full fine-tuning

Inference Optimization

Retrieve small subgraphs to reduce LLM input tokens

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/XiaoxinHe/G-Retriever

Data URLs

https://github.com/XiaoxinHe/G-Retriever (benchmark and processing scripts)

Risks & Boundaries

Limitations

Retrieval is static; retrieval model is not jointly trained with the LLM.

PCST adds complexity and hyperparameters (k, edge cost) that need tuning per dataset.

When Not To Use

When graphs are small enough to fit wholly in LLM context (no retrieval needed).

For non-textual graphs (no node/edge text) without a plan to add textual attributes.

Failure Modes

Important facts omitted if k is set too small, leading to wrong answers.

Noisy or semantically poor embeddings can retrieve irrelevant nodes, hurting answers.

Core Entities

Models

Llama2-7bLlama2-13bSentenceBertGraph TransformerGATGCNLoRA

Metrics

Hit@1AccuracyValid NodesValid EdgesFully Valid GraphsTokens reductionTraining time per epoch

Datasets

ExplaGraphsSceneGraphsWebQSPGraphQA (this work)

Benchmarks

GraphQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.

Graph-aware retrieval cuts textual graph size massively and speeds training: SceneGraphs tokens ↓83%, nodes ↓74%, time ↓29%; WebQSP tokens ↓99%, nodes ↓99%, time ↓67%.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding