A practical survey showing how knowledge graphs can make LLMs better at complex question answering

May 26, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

0

Authors

Chuangtao Ma, Yongrui Chen, Tianxing Wu, Arijit Khan, Haofen Wang

Links

Abstract / PDF

Why It Matters For Business

Combining KGs with LLMs reduces hallucinations and adds verifiable evidence for high-stakes QA, but it raises compute and maintenance costs—trade accuracy and traceability against latency and budget.

Summary TLDR

This is a focused survey that organizes and compares methods that combine large language models (LLMs) with knowledge graphs (KGs) to improve question answering (QA). It proposes a three-role taxonomy (KG as background knowledge, as reasoning guideline, and as refiner/validator), reviews representative systems (GraphRAG, KG-RAG, KG-Adapter, KG-Agent, etc.), summarizes benchmarks and metrics, and highlights practical bottlenecks: costly graph retrieval, knowledge misalignment, and KG incompleteness. The paper ends with concrete optimization ideas (indexing, prompt tuning, cost-aware policies) and research directions for scaling, dynamic updates, and fairness-aware retrieval.

Problem Statement

LLM-based QA is strong on language but struggles with complex, multi-step, time-sensitive, or domain-specific questions due to limited reasoning, outdated parametric knowledge, and hallucinations. How can structured, factual KGs be combined with LLMs to reduce hallucination, improve multi-hop reasoning, and provide explainable evidence while remaining efficient and up-to-date?

Main Contribution

A structured taxonomy that classifies LLM+KG QA methods by QA type and the KG's role: background knowledge, reasoning guideline, refiner/validator, and hybrid.

A systematic survey and comparison of recent representative methods, grouped by the KG role and aligned to complex QA tasks (multi-doc, multi-modal, multi-hop, conversational, explainable, temporal).

A summary of evaluation metrics, benchmark datasets, optimizations, and concrete open challenges: scaling, dynamic KG integration, explainability, and fairness-aware retrieval.

Key Findings

Using KGs in three roles (background, guideline, refiner) is the dominant design pattern for combining KGs with LLMs in QA.

Graph-based RAG (GraphRAG / KG-RAG) retrieves structured subgraphs rather than raw text and improves reasoning and evidence grounding compared to text-only RAG.

KG-guided reasoning (offline templates, online iterative guidance, or agent-based loops) yields more explainable multi-hop answers but is computationally heavier.

A central systems bottleneck is scalability: subgraph extraction, graph traversal, and vector indexing over large KGs are computationally costly.

KGs reduce hallucination and improve factual validation but introduce risks when KGs are incomplete, inconsistent, or outdated.

Who Should Care

What To Try In 7 Days

Prototype KG-augmented retrieval: add a subgraph retrieval step to your RAG pipeline and compare answer correctness on 50 domain questions.

Run a simple KG-based validator: re-check LLM answers against a KG and measure how many answers change or get flagged.

Measure retrieval quality: compute retrieval relevance (MRR/NDCG) and downstream answer quality (accuracy/EM) with and without KG input.

Agent Features

Memory

  • Retrieval memory / vector index
  • KG as external symbolic memory

Planning

  • LLM-driven beam/CoT path search
  • KG-guided question decomposition

Tool Use

  • KG query executors
  • Graph traversal agents (KG-Agent, KGP)
  • Indexing and vector DBs

Frameworks

  • KG-Agent
  • ODA
  • PoG

Architectures

  • LLM + GNN cross-encoder
  • RAG with subgraph retriever
  • Agent loop (LLM orchestrator + KG executor)

Collaboration

  • LLM + KG joint reasoning
  • LLM agents selecting KG tools

Optimization Features

Token Efficiency

  • Token-based KG-RAG optimizations (SPOKE-like approaches) to reduce LLM calls

Infra Optimization

  • Hierarchical graph partitioning and neighborhood expansion
  • Dynamic path-prior proposal networks for retrieval pruning

Model Optimization

  • LoRA

System Optimization

  • Caching subgraphs and intermediate embeddings
  • Amortized reasoning to avoid repeated KG queries

Training Optimization

  • Joint LM+GNN pretraining and knowledge-aware fine-tuning
  • Instruction fine-tuning with KG-derived prompts

Inference Optimization

  • Index-based retrieval (dynamic/adaptable indices)
  • Prompt-based filtering and CoT-guided filters
  • Token-call minimization and cost-based policies

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • May miss very recent papers due to rapid publication pace (authors note this).
  • Survey emphasizes taxonomy and qualitative alignment; it underemphasizes head-to-head quantitative comparisons.
  • Reported utility of KGs depends on KG coverage, freshness, and implementation details not standardized across studies.

When Not To Use

  • When no reliable KG exists for your domain or KG coverage is very sparse.
  • When ultra-low latency and very high throughput matter and you cannot afford KG traversal costs.
  • When the cost of maintaining and updating a KG outweighs benefits for simple factual queries.

Failure Modes

  • Knowledge conflicts between KG facts and LLM parametric facts can cause inconsistent answers.
  • Outdated or incomplete KGs lead to false negatives in validation and wrongful filtering of correct model outputs.
  • Large-scale graph traversal causes high latency and memory spikes if not optimized.

Core Entities

Models

  • GPT-4
  • GPT-3.5-Turbo
  • Llama-2
  • Llama-3
  • Qwen
  • Gemma
  • Vicuna
  • Zephyr
  • Mistral

Metrics

  • Answer Quality
  • Retrieval Quality
  • Reasoning Quality
  • BERTScore
  • MRR
  • NDCG
  • Hop-Acc
  • Truthfulness Score
  • Faithfulness Score

Datasets

  • WebQSP
  • WQSP
  • CWQ
  • HotpotQA
  • 2WikiMQA
  • MetaQA
  • PubMedQA
  • M3SciQA
  • FanOutQA
  • MINTQA
  • EXAQT

Benchmarks

  • STaRK
  • LLM-KG-Bench
  • OKGQA
  • XplainLLM
  • MINTQA
  • mmRAG

Context Entities

Models

  • RoBERTa
  • T5
  • FLAN-T5
  • SentenceTransformer

Metrics

  • Accuracy
  • Exact Match (EM)
  • F1
  • ROUGE
  • BLEU

Datasets

  • TriviaQA
  • OBQA
  • CSQA
  • BioASQ
  • MedQA
  • LiveQA

Benchmarks

  • FanOutQA
  • PatQA
  • TempTabQA