Use RAG + PCST to let LLMs 'chat' with very large textual graphs

February 12, 20248 min

Overview

Decision SnapshotNeeds Validation

The approach combines well-known components (sentence embeddings, k-NN, PCST, frozen LLM prompting). Results are consistent across three datasets and include human-checked hallucination metrics, but production tuning (trainable retrieval) and large-scale deployment remain to be tested.

Citations22

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, Bryan Hooi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need natural-language queries over large text-rich graphs, G-Retriever scales to huge graphs, speeds training and inference dramatically, and reduces wrong citations by returning the exact subgraph used to answer.

Who Should Care

Summary TLDR

G-Retriever is a retrieval-augmented generation (RAG) system that answers natural-language questions about graphs whose nodes and edges carry text. It encodes nodes/edges with SentenceBert, retrieves relevant pieces via k-NN, then extracts a connected subgraph using a Prize-Collecting Steiner Tree (PCST) algorithm. The subgraph is turned into text and fed to a frozen LLM (Llama2) with a learned graph token. Across three datasets (ExplaGraphs, SceneGraphs, WebQSP) it improves QA accuracy, drastically cuts token and compute costs, and reduces hallucinated citations of graph elements.

Problem Statement

LLMs struggle to answer questions about large textual graphs because (1) flattening whole graphs overflows the LLM context and loses info, and (2) LLMs hallucinate nodes/edges when they cannot access exact graph facts. The paper builds a retrieval pipeline tailored to general textual graphs and a benchmark to measure both QA and hallucination.

Main Contribution

GraphQA benchmark: standardized, multi-domain GraphQA built from ExplaGraphs, SceneGraphs, WebQSP for node/edge QA and multi‑hop questions.

G-Retriever: first RAG design for general textual graphs that returns a connected subgraph via PCST and feeds it as a soft graph prompt to a frozen LLM.

Key Findings

G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.

NumbersWebQSP: GraphToken 57.05% → G-Retriever 70.49% → G-Retriever+LoRA 73.79%

Practical UseUse graph-aware RAG plus soft prompting or LoRA to get large accuracy gains on multi-hop KG-style QA.

Evidence RefTable 3; Table 8

Graph-aware retrieval cuts textual graph size massively and speeds training: SceneGraphs tokens ↓83%, nodes ↓74%, time ↓29%; WebQSP tokens ↓99%, nodes ↓99%, time ↓67%.

NumbersSceneGraphs tokens -83%, nodes -74%, time -29%; WebQSP tokens -99%, nodes -99%, time -67%

Practical UseIf your graphs exceed the LLM context, retrieve a small connected subgraph to make LLM queries tractable and much faster.

Evidence RefTable 4; Section 6.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WebQSP Hit@1 (Frozen LLM w/ prompt tuning)70.49 ± 1.21GraphToken 57.05 ± 0.74+13.44ppWebQSP validation/testTable 3; Section 6.2Table 3
LoRA73.79 ± 0.70LoRA without G-Retriever 66.03 ± 0.47+7.76ppWebQSP validation/testTable 3; Section 6.2Table 3

What To Try In 7 Days

Index a small textual graph with SentenceBert, run k-NN, and extract a PCST subgraph to see token savings.

Add a single learned graph token (soft prompt) to a frozen LLM and feed the textualized subgraph plus question.

Manually evaluate 50 answers for cited node/edge correctness to measure hallucination reduction.

Optimization Features

Token Efficiency
PCST-based retrieval reduces tokenized graph size by up to 99% on WebQSP
Model Optimization
LoRA
System Optimization
Use k-NN on SentenceBert embeddings and PCST to cut compute and epoch time
Training Optimization
Prompt tuning for frozen LLMs to avoid full fine-tuning
Inference Optimization
Retrieve small subgraphs to reduce LLM input tokens

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Retrieval is static; retrieval model is not jointly trained with the LLM.

PCST adds complexity and hyperparameters (k, edge cost) that need tuning per dataset.

When Not To Use

When graphs are small enough to fit wholly in LLM context (no retrieval needed).

For non-textual graphs (no node/edge text) without a plan to add textual attributes.

Failure Modes

Important facts omitted if k is set too small, leading to wrong answers.

Noisy or semantically poor embeddings can retrieve irrelevant nodes, hurting answers.

Core Entities

Models

Llama2-7bLlama2-13bSentenceBertGraph TransformerGATGCNLoRA

Metrics

Hit@1AccuracyValid NodesValid EdgesFully Valid GraphsTokens reductionTraining time per epoch

Datasets

ExplaGraphsSceneGraphsWebQSPGraphQA (this work)

Benchmarks

GraphQA