Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
16
Why It Matters For Business
Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.
Summary TLDR
This survey reviews methods that add structured knowledge (knowledge graphs, KGs) to large language models to reduce hallucinations. It groups approaches into three practical stages: KG-augmented inference (retrieval, reasoning, controlled generation), KG-aware training (pre-training and fine-tuning), and KG-based validation (fact-checking). The authors report that KG retrieval often boosts small-model QA accuracy substantially (papers report >80% improvements on evaluated QA tasks) and that KG-guided step-wise reasoning can raise reasoning accuracy (e.g., RoG raised ChatGPT from 66.8% to 85.7% on its tests). The survey highlights trade-offs: retrieval and validation are low-cost but rely on
Problem Statement
LLMs often produce plausible-sounding but incorrect statements ('hallucinations') because their internal knowledge is incomplete or outdated. The paper asks: can structured external knowledge (knowledge graphs) be added at inference, training, or validation stages to reduce hallucinations and improve reasoning?
Main Contribution
A concise taxonomy that groups KG-augmentation methods into Knowledge-Aware Inference, Knowledge-Aware Training, and Knowledge-Aware Validation.
A comparison table of representative methods, datasets, LLMs, and training costs to help pick an approach.
A practical synthesis of empirical findings, limits, trends, and open directions for reducing hallucinations with KGs.
Key Findings
KG-augmented retrieval can dramatically improve QA correctness for small models.
KG-guided stepwise reasoning substantially raises reasoning accuracy on evaluated tasks.
KG-based methods can achieve high domain accuracy when paired with domain graphs.
Pre-training or heavy fine-tuning with KGs improves domain performance but is costly.
KG validation (fact-checking) reduces hallucinations but adds runtime cost and can miss gaps.
Results
QA answer correctness (small LMs)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Build a simple KG-backed retriever and prepend retrieved triples to model prompts for QA experiments.
Run KG-based post-generation fact checks on a sample of high-stakes outputs to measure hallucination rates.
Pilot chain-of-thought + KG retrieval on a handful of multi-step queries to compare accuracy vs. baseline.
Agent Features
Memory
- external KG as non-parametric memory
Tool Use
- KG retrievers
- SPARQL or structured query interfaces
- prompting frameworks (LangChain, LlamaIndex)
Frameworks
- RAG
- StructGPT
- KICGPT
Architectures
- retrieval-augmented pipelines
- graph-text joint encoders
Optimization Features
Token Efficiency
- Textualized KG triples used as compact context
Infra Optimization
- Prefer retrieval/validation when compute budget prevents heavy fine-tuning
Model Optimization
- Accuracy
System Optimization
- Use fast KG retrievers to limit latency
- Cache common subgraphs for frequent queries
Training Optimization
- KG-guided masking and fusion for pre-training (knowledge-enhanced pretraining)
- KG-based synthetic corpora for targeted fine-tuning
Inference Optimization
- Retrieve relevant KG triples at runtime instead of retraining
- Controlled generation to limit model outputs to KG facts
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Survey may miss recent or niche works due to page and timeframe limits.
- Comparisons mix heterogeneous benchmarks and setups, limiting direct apples-to-apples claims.
- KG methods depend on KG coverage and quality; limited or biased KGs reduce benefits.
- Pretraining/fine-tuning approaches are resource-intensive and less portable.
When Not To Use
- For casual conversational agents where perfect factuality is not required.
- When no reliable KG covers your domain or building one is infeasible.
- When low-latency, zero-additional-infrastructure inference is mandatory.
Failure Modes
- Retriever returns irrelevant or outdated triples, causing confident but wrong answers.
- KGs introduce bias or propagate incorrect facts from their sources.
- Validation misses gaps in the KG, giving a false sense of safety.
- KG integration increases latency and system complexity.
Core Entities
Models
- GPT-3
- GPT-3.5
- GPT-4
- PaLM
- T5
- Flan-T5
- Llama
- BART
- BERT
- RoBERTa
Metrics
- Accuracy
- Top-K
- MRR
- Hits@1
- Exact Match
- Human evaluation
Datasets
- WebQSP
- WebQuestions
- ComplexWebQuestions
- Mintaka
- MetaQA
- HotpotQA
- 2WikiMultiHopQA
- WebQ
- ZJQA
- SST
- WikiTableQuestions
- TabFact
- FEVEROUS
- SciFact-Open
Benchmarks
- KGQA benchmarks (WebQSP, LC-QuAD, MetaQA)
- Commonsense QA (CommonSenseQA, OpenBookQA)
- Multi-hop QA (HotpotQA, 2WikiMultiHopQA)
- Fact verification (FEVEROUS, SciFact-Open)

