Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Combining KGs with LLMs reduces hallucinations and adds verifiable evidence for high-stakes QA, but it raises compute and maintenance costs—trade accuracy and traceability against latency and budget.
Summary TLDR
This is a focused survey that organizes and compares methods that combine large language models (LLMs) with knowledge graphs (KGs) to improve question answering (QA). It proposes a three-role taxonomy (KG as background knowledge, as reasoning guideline, and as refiner/validator), reviews representative systems (GraphRAG, KG-RAG, KG-Adapter, KG-Agent, etc.), summarizes benchmarks and metrics, and highlights practical bottlenecks: costly graph retrieval, knowledge misalignment, and KG incompleteness. The paper ends with concrete optimization ideas (indexing, prompt tuning, cost-aware policies) and research directions for scaling, dynamic updates, and fairness-aware retrieval.
Problem Statement
LLM-based QA is strong on language but struggles with complex, multi-step, time-sensitive, or domain-specific questions due to limited reasoning, outdated parametric knowledge, and hallucinations. How can structured, factual KGs be combined with LLMs to reduce hallucination, improve multi-hop reasoning, and provide explainable evidence while remaining efficient and up-to-date?
Main Contribution
A structured taxonomy that classifies LLM+KG QA methods by QA type and the KG's role: background knowledge, reasoning guideline, refiner/validator, and hybrid.
A systematic survey and comparison of recent representative methods, grouped by the KG role and aligned to complex QA tasks (multi-doc, multi-modal, multi-hop, conversational, explainable, temporal).
A summary of evaluation metrics, benchmark datasets, optimizations, and concrete open challenges: scaling, dynamic KG integration, explainability, and fairness-aware retrieval.
Key Findings
Using KGs in three roles (background, guideline, refiner) is the dominant design pattern for combining KGs with LLMs in QA.
Graph-based RAG (GraphRAG / KG-RAG) retrieves structured subgraphs rather than raw text and improves reasoning and evidence grounding compared to text-only RAG.
KG-guided reasoning (offline templates, online iterative guidance, or agent-based loops) yields more explainable multi-hop answers but is computationally heavier.
A central systems bottleneck is scalability: subgraph extraction, graph traversal, and vector indexing over large KGs are computationally costly.
KGs reduce hallucination and improve factual validation but introduce risks when KGs are incomplete, inconsistent, or outdated.
Who Should Care
What To Try In 7 Days
Prototype KG-augmented retrieval: add a subgraph retrieval step to your RAG pipeline and compare answer correctness on 50 domain questions.
Run a simple KG-based validator: re-check LLM answers against a KG and measure how many answers change or get flagged.
Measure retrieval quality: compute retrieval relevance (MRR/NDCG) and downstream answer quality (accuracy/EM) with and without KG input.
Agent Features
Memory
- Retrieval memory / vector index
- KG as external symbolic memory
Planning
- LLM-driven beam/CoT path search
- KG-guided question decomposition
Tool Use
- KG query executors
- Graph traversal agents (KG-Agent, KGP)
- Indexing and vector DBs
Frameworks
- KG-Agent
- ODA
- PoG
Architectures
- LLM + GNN cross-encoder
- RAG with subgraph retriever
- Agent loop (LLM orchestrator + KG executor)
Collaboration
- LLM + KG joint reasoning
- LLM agents selecting KG tools
Optimization Features
Token Efficiency
- Token-based KG-RAG optimizations (SPOKE-like approaches) to reduce LLM calls
Infra Optimization
- Hierarchical graph partitioning and neighborhood expansion
- Dynamic path-prior proposal networks for retrieval pruning
Model Optimization
- LoRA
System Optimization
- Caching subgraphs and intermediate embeddings
- Amortized reasoning to avoid repeated KG queries
Training Optimization
- Joint LM+GNN pretraining and knowledge-aware fine-tuning
- Instruction fine-tuning with KG-derived prompts
Inference Optimization
- Index-based retrieval (dynamic/adaptable indices)
- Prompt-based filtering and CoT-guided filters
- Token-call minimization and cost-based policies
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- May miss very recent papers due to rapid publication pace (authors note this).
- Survey emphasizes taxonomy and qualitative alignment; it underemphasizes head-to-head quantitative comparisons.
- Reported utility of KGs depends on KG coverage, freshness, and implementation details not standardized across studies.
When Not To Use
- When no reliable KG exists for your domain or KG coverage is very sparse.
- When ultra-low latency and very high throughput matter and you cannot afford KG traversal costs.
- When the cost of maintaining and updating a KG outweighs benefits for simple factual queries.
Failure Modes
- Knowledge conflicts between KG facts and LLM parametric facts can cause inconsistent answers.
- Outdated or incomplete KGs lead to false negatives in validation and wrongful filtering of correct model outputs.
- Large-scale graph traversal causes high latency and memory spikes if not optimized.
Core Entities
Models
- GPT-4
- GPT-3.5-Turbo
- Llama-2
- Llama-3
- Qwen
- Gemma
- Vicuna
- Zephyr
- Mistral
Metrics
- Answer Quality
- Retrieval Quality
- Reasoning Quality
- BERTScore
- MRR
- NDCG
- Hop-Acc
- Truthfulness Score
- Faithfulness Score
Datasets
- WebQSP
- WQSP
- CWQ
- HotpotQA
- 2WikiMQA
- MetaQA
- PubMedQA
- M3SciQA
- FanOutQA
- MINTQA
- EXAQT
Benchmarks
- STaRK
- LLM-KG-Bench
- OKGQA
- XplainLLM
- MINTQA
- mmRAG
Context Entities
Models
- RoBERTa
- T5
- FLAN-T5
- SentenceTransformer
Metrics
- Accuracy
- Exact Match (EM)
- F1
- ROUGE
- BLEU
Datasets
- TriviaQA
- OBQA
- CSQA
- BioASQ
- MedQA
- LiveQA
Benchmarks
- FanOutQA
- PatQA
- TempTabQA

