A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Overview

Decision SnapshotNeeds Validation

Scores reflect a literature survey: the paper synthesizes many published systems and datasets, so conclusions are broad but not backed by a unified experimental protocol.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 0/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, Haofen Wang

Links

Abstract / PDF

Why It Matters For Business

Combining KGs with LLMs reduces hallucinations and adds verifiable evidence for high-stakes QA, but it raises compute and maintenance costs—trade accuracy and traceability against latency and budget.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

This is a focused survey that organizes and compares methods that combine large language models (LLMs) with knowledge graphs (KGs) to improve question answering (QA). It proposes a three-role taxonomy (KG as background knowledge, as reasoning guideline, and as refiner/validator), reviews representative systems (GraphRAG, KG-RAG, KG-Adapter, KG-Agent, etc.), summarizes benchmarks and metrics, and highlights practical bottlenecks: costly graph retrieval, knowledge misalignment, and KG incompleteness. The paper ends with concrete optimization ideas (indexing, prompt tuning, cost-aware policies) and research directions for scaling, dynamic updates, and fairness-aware retrieval.

Problem Statement

LLM-based QA is strong on language but struggles with complex, multi-step, time-sensitive, or domain-specific questions due to limited reasoning, outdated parametric knowledge, and hallucinations. How can structured, factual KGs be combined with LLMs to reduce hallucination, improve multi-hop reasoning, and provide explainable evidence while remaining efficient and up-to-date?

Main Contribution

A structured taxonomy that classifies LLM+KG QA methods by QA type and the KG's role: background knowledge, reasoning guideline, refiner/validator, and hybrid.

A systematic survey and comparison of recent representative methods, grouped by the KG role and aligned to complex QA tasks (multi-doc, multi-modal, multi-hop, conversational, explainable, temporal).

Key Findings

Using KGs in three roles (background, guideline, refiner) is the dominant design pattern for combining KGs with LLMs in QA.

Practical UseWhen building QA systems, pick a clear KG role early: feed factual subgraphs as background context, use subgraph paths to guide LLM reasoning, or apply KG checks to filter/refine LLM outputs.

Evidence RefAbstract; §3 (Taxonomy and §3.1-3.3)

Graph-based RAG (GraphRAG / KG-RAG) retrieves structured subgraphs rather than raw text and improves reasoning and evidence grounding compared to text-only RAG.

Practical UseIf your QA questions need multi-hop or factual chain evidence, replace or augment text retrieval with subgraph retrieval to get better reasoning anchors.

Evidence Ref§3.1.2 (Graph RAG, KG-RAG) and Table 3 summaries

What To Try In 7 Days

Prototype KG-augmented retrieval: add a subgraph retrieval step to your RAG pipeline and compare answer correctness on 50 domain questions.

Run a simple KG-based validator: re-check LLM answers against a KG and measure how many answers change or get flagged.

Measure retrieval quality: compute retrieval relevance (MRR/NDCG) and downstream answer quality (accuracy/EM) with and without KG input.

Agent Features

Memory

Retrieval memory / vector indexKG as external symbolic memory

Planning

LLM-driven beam/CoT path searchKG-guided question decomposition

Tool Use

KG query executorsGraph traversal agents (KG-Agent, KGP)Indexing and vector DBs

Frameworks

KG-AgentODAPoG

Architectures

LLM + GNN cross-encoderRAG with subgraph retrieverAgent loop (LLM orchestrator + KG executor)

Collaboration

LLM + KG joint reasoningLLM agents selecting KG tools

Optimization Features

Token Efficiency

Token-based KG-RAG optimizations (SPOKE-like approaches) to reduce LLM calls

Infra Optimization

Hierarchical graph partitioning and neighborhood expansionDynamic path-prior proposal networks for retrieval pruning

Model Optimization

LoRA

System Optimization

Caching subgraphs and intermediate embeddingsAmortized reasoning to avoid repeated KG queries

Training Optimization

Joint LM+GNN pretraining and knowledge-aware fine-tuningInstruction fine-tuning with KG-derived prompts

Inference Optimization

Index-based retrieval (dynamic/adaptable indices)Prompt-based filtering and CoT-guided filtersToken-call minimization and cost-based policies

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

May miss very recent papers due to rapid publication pace (authors note this).

Survey emphasizes taxonomy and qualitative alignment; it underemphasizes head-to-head quantitative comparisons.

When Not To Use

When no reliable KG exists for your domain or KG coverage is very sparse.

When ultra-low latency and very high throughput matter and you cannot afford KG traversal costs.

Failure Modes

Knowledge conflicts between KG facts and LLM parametric facts can cause inconsistent answers.

Outdated or incomplete KGs lead to false negatives in validation and wrongful filtering of correct model outputs.

Core Entities

Models

GPT-4GPT-3.5-TurboLlama-2Llama-3QwenGemmaVicunaZephyrMistral

Metrics

Answer QualityRetrieval QualityReasoning QualityBERTScoreMRRNDCGHop-AccTruthfulness ScoreFaithfulness Score

Datasets

WebQSPWQSPCWQHotpotQA2WikiMQAMetaQAPubMedQAM3SciQAFanOutQAMINTQAEXAQT

Benchmarks

STaRKLLM-KG-BenchOKGQAXplainLLMMINTQAmmRAG

Context Entities

Models

RoBERTaT5FLAN-T5SentenceTransformer

Metrics

AccuracyExact Match (EM)F1ROUGEBLEU

Datasets

TriviaQAOBQACSQABioASQMedQALiveQA

Benchmarks

FanOutQAPatQATempTabQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Using KGs in three roles (background, guideline, refiner) is the dominant design pattern for combining KGs with LLMs in QA.

Graph-based RAG (GraphRAG / KG-RAG) retrieves structured subgraphs rather than raw text and improves reasoning and evidence grounding compared to text-only RAG.

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding

LLMs generate, explain and iteratively fix Cypher queries so non-experts can ask graph databases in plain English

Key finding