Overview
The approach combines well-known components (sentence embeddings, k-NN, PCST, frozen LLM prompting). Results are consistent across three datasets and include human-checked hallucination metrics, but production tuning (trainable retrieval) and large-scale deployment remain to be tested.
Citations22
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
If you need natural-language queries over large text-rich graphs, G-Retriever scales to huge graphs, speeds training and inference dramatically, and reduces wrong citations by returning the exact subgraph used to answer.
Who Should Care
Summary TLDR
G-Retriever is a retrieval-augmented generation (RAG) system that answers natural-language questions about graphs whose nodes and edges carry text. It encodes nodes/edges with SentenceBert, retrieves relevant pieces via k-NN, then extracts a connected subgraph using a Prize-Collecting Steiner Tree (PCST) algorithm. The subgraph is turned into text and fed to a frozen LLM (Llama2) with a learned graph token. Across three datasets (ExplaGraphs, SceneGraphs, WebQSP) it improves QA accuracy, drastically cuts token and compute costs, and reduces hallucinated citations of graph elements.
Problem Statement
LLMs struggle to answer questions about large textual graphs because (1) flattening whole graphs overflows the LLM context and loses info, and (2) LLMs hallucinate nodes/edges when they cannot access exact graph facts. The paper builds a retrieval pipeline tailored to general textual graphs and a benchmark to measure both QA and hallucination.
Main Contribution
GraphQA benchmark: standardized, multi-domain GraphQA built from ExplaGraphs, SceneGraphs, WebQSP for node/edge QA and multi‑hop questions.
G-Retriever: first RAG design for general textual graphs that returns a connected subgraph via PCST and feeds it as a soft graph prompt to a frozen LLM.
Key Findings
G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.
Graph-aware retrieval cuts textual graph size massively and speeds training: SceneGraphs tokens ↓83%, nodes ↓74%, time ↓29%; WebQSP tokens ↓99%, nodes ↓99%, time ↓67%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| WebQSP Hit@1 (Frozen LLM w/ prompt tuning) | 70.49 ± 1.21 | GraphToken 57.05 ± 0.74 | +13.44pp | WebQSP validation/test | Table 3; Section 6.2 | Table 3 |
| LoRA | 73.79 ± 0.70 | LoRA without G-Retriever 66.03 ± 0.47 | +7.76pp | WebQSP validation/test | Table 3; Section 6.2 | Table 3 |
What To Try In 7 Days
Index a small textual graph with SentenceBert, run k-NN, and extract a PCST subgraph to see token savings.
Add a single learned graph token (soft prompt) to a frozen LLM and feed the textualized subgraph plus question.
Manually evaluate 50 answers for cited node/edge correctness to measure hallucination reduction.
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Retrieval is static; retrieval model is not jointly trained with the LLM.
PCST adds complexity and hyperparameters (k, edge cost) that need tuning per dataset.
When Not To Use
When graphs are small enough to fit wholly in LLM context (no retrieval needed).
For non-textual graphs (no node/edge text) without a plan to add textual attributes.
Failure Modes
Important facts omitted if k is set too small, leading to wrong answers.
Noisy or semantically poor embeddings can retrieve irrelevant nodes, hurting answers.

