Survey: Can knowledge graphs reduce hallucinations in large language models?

Overview

Decision SnapshotNeeds Validation

KG augmentation gives practical gains (especially for small models and domain tasks) but depends on retriever quality, KG coverage, and added latency; choose retrieval/validation for immediate benefit and heavy training only for critical domains.

Citations16

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, Huan Liu

Links

Abstract / PDF

Why It Matters For Business

Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This survey reviews methods that add structured knowledge (knowledge graphs, KGs) to large language models to reduce hallucinations. It groups approaches into three practical stages: KG-augmented inference (retrieval, reasoning, controlled generation), KG-aware training (pre-training and fine-tuning), and KG-based validation (fact-checking). The authors report that KG retrieval often boosts small-model QA accuracy substantially (papers report >80% improvements on evaluated QA tasks) and that KG-guided step-wise reasoning can raise reasoning accuracy (e.g., RoG raised ChatGPT from 66.8% to 85.7% on its tests). The survey highlights trade-offs: retrieval and validation are low-cost but rely on

Problem Statement

LLMs often produce plausible-sounding but incorrect statements ('hallucinations') because their internal knowledge is incomplete or outdated. The paper asks: can structured external knowledge (knowledge graphs) be added at inference, training, or validation stages to reduce hallucinations and improve reasoning?

Main Contribution

A concise taxonomy that groups KG-augmentation methods into Knowledge-Aware Inference, Knowledge-Aware Training, and Knowledge-Aware Validation.

A comparison table of representative methods, datasets, LLMs, and training costs to help pick an approach.

Key Findings

KG-augmented retrieval can dramatically improve QA correctness for small models.

Numbersreported >80% answer correctness gain on QA (Baek et al.; Sen et al.; Wu et al.)

Practical UseIf you run QA with a small model, add KG-based retrieval first — it's a high-impact, low-cost fix compared with scaling model size.

Evidence RefSection 4.3; references Baek et al. 2023; Sen et al. 2023; Wu et al. 2023

KG-guided stepwise reasoning substantially raises reasoning accuracy on evaluated tasks.

NumbersChatGPT accuracy rose from 66.8% to 85.7% with RoG (Luo et al., 2023)

Practical UseFor multi-step or complex QA, combine chain-of-thought prompting with KG retrieval to improve correctness and provide interpretable reasoning paths.

Evidence RefSection 3.1.2 and 4.3; Luo et al. 2023

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
QA answer correctness (small LMs)	>80% improvement reported on evaluated QA tasks when augmented with KG retrieval	no KG retrieval	>=80% gain (as reported)	evaluated QA datasets (Baek et al.; Sen et al.; Wu et al.)	Section 4.3; Baek et al. 2023; Sen et al. 2023; Wu et al. 2023	Section 4.3
Accuracy	85.7% with KG-guided reasoning	66.8% without KG-guided reasoning	+18.9 percentage points	reasoning benchmark used in Luo et al. 2023	Section 3.1.2 and 4.3; Luo et al. 2023	Luo et al. 2023

What To Try In 7 Days

Build a simple KG-backed retriever and prepend retrieved triples to model prompts for QA experiments.

Run KG-based post-generation fact checks on a sample of high-stakes outputs to measure hallucination rates.

Pilot chain-of-thought + KG retrieval on a handful of multi-step queries to compare accuracy vs. baseline.

Agent Features

Memory

external KG as non-parametric memory

Tool Use

KG retrieversSPARQL or structured query interfacesprompting frameworks (LangChain, LlamaIndex)

Frameworks

RAGStructGPTKICGPT

Architectures

retrieval-augmented pipelinesgraph-text joint encoders

Optimization Features

Token Efficiency

Textualized KG triples used as compact context

Infra Optimization

Prefer retrieval/validation when compute budget prevents heavy fine-tuning

Model Optimization

Accuracy

System Optimization

Use fast KG retrievers to limit latencyCache common subgraphs for frequent queries

Training Optimization

KG-guided masking and fusion for pre-training (knowledge-enhanced pretraining)KG-based synthetic corpora for targeted fine-tuning

Inference Optimization

Retrieve relevant KG triples at runtime instead of retrainingControlled generation to limit model outputs to KG facts

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Survey may miss recent or niche works due to page and timeframe limits.

Comparisons mix heterogeneous benchmarks and setups, limiting direct apples-to-apples claims.

When Not To Use

For casual conversational agents where perfect factuality is not required.

When no reliable KG covers your domain or building one is infeasible.

Failure Modes

Retriever returns irrelevant or outdated triples, causing confident but wrong answers.

KGs introduce bias or propagate incorrect facts from their sources.

Core Entities

Models

GPT-3GPT-3.5GPT-4PaLMT5Flan-T5LlamaBARTBERTRoBERTa

Metrics

AccuracyTop-KMRRHits@1Exact MatchHuman evaluation

Datasets

WebQSPWebQuestionsComplexWebQuestionsMintakaMetaQAHotpotQA2WikiMultiHopQAWebQZJQASSTWikiTableQuestionsTabFactFEVEROUSSciFact-Open

Benchmarks

KGQA benchmarks (WebQSP, LC-QuAD, MetaQA)Commonsense QA (CommonSenseQA, OpenBookQA)Multi-hop QA (HotpotQA, 2WikiMultiHopQA)Fact verification (FEVEROUS, SciFact-Open)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KG-augmented retrieval can dramatically improve QA correctness for small models.

KG-guided stepwise reasoning substantially raises reasoning accuracy on evaluated tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding