A practical survey showing how knowledge graphs can make LLMs better at complex question answering

May 26, 20258 min

Overview

Decision SnapshotNeeds Validation

Scores reflect a literature survey: the paper synthesizes many published systems and datasets, so conclusions are broad but not backed by a unified experimental protocol.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 0/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, Haofen Wang

Links

Abstract / PDF

Why It Matters For Business

Combining KGs with LLMs reduces hallucinations and adds verifiable evidence for high-stakes QA, but it raises compute and maintenance costs—trade accuracy and traceability against latency and budget.

Who Should Care

Summary TLDR

This is a focused survey that organizes and compares methods that combine large language models (LLMs) with knowledge graphs (KGs) to improve question answering (QA). It proposes a three-role taxonomy (KG as background knowledge, as reasoning guideline, and as refiner/validator), reviews representative systems (GraphRAG, KG-RAG, KG-Adapter, KG-Agent, etc.), summarizes benchmarks and metrics, and highlights practical bottlenecks: costly graph retrieval, knowledge misalignment, and KG incompleteness. The paper ends with concrete optimization ideas (indexing, prompt tuning, cost-aware policies) and research directions for scaling, dynamic updates, and fairness-aware retrieval.

Problem Statement

LLM-based QA is strong on language but struggles with complex, multi-step, time-sensitive, or domain-specific questions due to limited reasoning, outdated parametric knowledge, and hallucinations. How can structured, factual KGs be combined with LLMs to reduce hallucination, improve multi-hop reasoning, and provide explainable evidence while remaining efficient and up-to-date?

Main Contribution

A structured taxonomy that classifies LLM+KG QA methods by QA type and the KG's role: background knowledge, reasoning guideline, refiner/validator, and hybrid.

A systematic survey and comparison of recent representative methods, grouped by the KG role and aligned to complex QA tasks (multi-doc, multi-modal, multi-hop, conversational, explainable, temporal).

Key Findings

Using KGs in three roles (background, guideline, refiner) is the dominant design pattern for combining KGs with LLMs in QA.

Practical UseWhen building QA systems, pick a clear KG role early: feed factual subgraphs as background context, use subgraph paths to guide LLM reasoning, or apply KG checks to filter/refine LLM outputs.

Evidence RefAbstract; §3 (Taxonomy and §3.1-3.3)

Graph-based RAG (GraphRAG / KG-RAG) retrieves structured subgraphs rather than raw text and improves reasoning and evidence grounding compared to text-only RAG.

Practical UseIf your QA questions need multi-hop or factual chain evidence, replace or augment text retrieval with subgraph retrieval to get better reasoning anchors.

Evidence Ref§3.1.2 (Graph RAG, KG-RAG) and Table 3 summaries

What To Try In 7 Days

Prototype KG-augmented retrieval: add a subgraph retrieval step to your RAG pipeline and compare answer correctness on 50 domain questions.

Run a simple KG-based validator: re-check LLM answers against a KG and measure how many answers change or get flagged.

Measure retrieval quality: compute retrieval relevance (MRR/NDCG) and downstream answer quality (accuracy/EM) with and without KG input.

Agent Features

Memory
Retrieval memory / vector indexKG as external symbolic memory
Planning
LLM-driven beam/CoT path searchKG-guided question decomposition
Tool Use
KG query executorsGraph traversal agents (KG-Agent, KGP)Indexing and vector DBs
Frameworks
KG-AgentODAPoG
Architectures
LLM + GNN cross-encoderRAG with subgraph retrieverAgent loop (LLM orchestrator + KG executor)
Collaboration
LLM + KG joint reasoningLLM agents selecting KG tools

Optimization Features

Token Efficiency
Token-based KG-RAG optimizations (SPOKE-like approaches) to reduce LLM calls
Infra Optimization
Hierarchical graph partitioning and neighborhood expansionDynamic path-prior proposal networks for retrieval pruning
Model Optimization
LoRA
System Optimization
Caching subgraphs and intermediate embeddingsAmortized reasoning to avoid repeated KG queries
Training Optimization
Joint LM+GNN pretraining and knowledge-aware fine-tuningInstruction fine-tuning with KG-derived prompts
Inference Optimization
Index-based retrieval (dynamic/adaptable indices)Prompt-based filtering and CoT-guided filtersToken-call minimization and cost-based policies

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

May miss very recent papers due to rapid publication pace (authors note this).

Survey emphasizes taxonomy and qualitative alignment; it underemphasizes head-to-head quantitative comparisons.

When Not To Use

When no reliable KG exists for your domain or KG coverage is very sparse.

When ultra-low latency and very high throughput matter and you cannot afford KG traversal costs.

Failure Modes

Knowledge conflicts between KG facts and LLM parametric facts can cause inconsistent answers.

Outdated or incomplete KGs lead to false negatives in validation and wrongful filtering of correct model outputs.

Core Entities

Models

GPT-4GPT-3.5-TurboLlama-2Llama-3QwenGemmaVicunaZephyrMistral

Metrics

Answer QualityRetrieval QualityReasoning QualityBERTScoreMRRNDCGHop-AccTruthfulness ScoreFaithfulness Score

Datasets

WebQSPWQSPCWQHotpotQA2WikiMQAMetaQAPubMedQAM3SciQAFanOutQAMINTQAEXAQT

Benchmarks

STaRKLLM-KG-BenchOKGQAXplainLLMMINTQAmmRAG

Context Entities

Models

RoBERTaT5FLAN-T5SentenceTransformer

Metrics

AccuracyExact Match (EM)F1ROUGEBLEU

Datasets

TriviaQAOBQACSQABioASQMedQALiveQA

Benchmarks

FanOutQAPatQATempTabQA