Survey: Can knowledge graphs reduce hallucinations in large language models?

November 14, 20238 min

Overview

Decision SnapshotNeeds Validation

KG augmentation gives practical gains (especially for small models and domain tasks) but depends on retriever quality, KG coverage, and added latency; choose retrieval/validation for immediate benefit and heavy training only for critical domains.

Citations16

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, Huan Liu

Links

Abstract / PDF

Why It Matters For Business

Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.

Who Should Care

Summary TLDR

This survey reviews methods that add structured knowledge (knowledge graphs, KGs) to large language models to reduce hallucinations. It groups approaches into three practical stages: KG-augmented inference (retrieval, reasoning, controlled generation), KG-aware training (pre-training and fine-tuning), and KG-based validation (fact-checking). The authors report that KG retrieval often boosts small-model QA accuracy substantially (papers report >80% improvements on evaluated QA tasks) and that KG-guided step-wise reasoning can raise reasoning accuracy (e.g., RoG raised ChatGPT from 66.8% to 85.7% on its tests). The survey highlights trade-offs: retrieval and validation are low-cost but rely on

Problem Statement

LLMs often produce plausible-sounding but incorrect statements ('hallucinations') because their internal knowledge is incomplete or outdated. The paper asks: can structured external knowledge (knowledge graphs) be added at inference, training, or validation stages to reduce hallucinations and improve reasoning?

Main Contribution

A concise taxonomy that groups KG-augmentation methods into Knowledge-Aware Inference, Knowledge-Aware Training, and Knowledge-Aware Validation.

A comparison table of representative methods, datasets, LLMs, and training costs to help pick an approach.

Key Findings

KG-augmented retrieval can dramatically improve QA correctness for small models.

Numbersreported >80% answer correctness gain on QA (Baek et al.; Sen et al.; Wu et al.)

Practical UseIf you run QA with a small model, add KG-based retrieval first — it's a high-impact, low-cost fix compared with scaling model size.

Evidence RefSection 4.3; references Baek et al. 2023; Sen et al. 2023; Wu et al. 2023

KG-guided stepwise reasoning substantially raises reasoning accuracy on evaluated tasks.

NumbersChatGPT accuracy rose from 66.8% to 85.7% with RoG (Luo et al., 2023)

Practical UseFor multi-step or complex QA, combine chain-of-thought prompting with KG retrieval to improve correctness and provide interpretable reasoning paths.

Evidence RefSection 3.1.2 and 4.3; Luo et al. 2023

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
QA answer correctness (small LMs)>80% improvement reported on evaluated QA tasks when augmented with KG retrievalno KG retrieval>=80% gain (as reported)evaluated QA datasets (Baek et al.; Sen et al.; Wu et al.)Section 4.3; Baek et al. 2023; Sen et al. 2023; Wu et al. 2023Section 4.3
Accuracy85.7% with KG-guided reasoning66.8% without KG-guided reasoning+18.9 percentage pointsreasoning benchmark used in Luo et al. 2023Section 3.1.2 and 4.3; Luo et al. 2023Luo et al. 2023

What To Try In 7 Days

Build a simple KG-backed retriever and prepend retrieved triples to model prompts for QA experiments.

Run KG-based post-generation fact checks on a sample of high-stakes outputs to measure hallucination rates.

Pilot chain-of-thought + KG retrieval on a handful of multi-step queries to compare accuracy vs. baseline.

Agent Features

Memory
external KG as non-parametric memory
Tool Use
KG retrieversSPARQL or structured query interfacesprompting frameworks (LangChain, LlamaIndex)
Frameworks
RAGStructGPTKICGPT
Architectures
retrieval-augmented pipelinesgraph-text joint encoders

Optimization Features

Token Efficiency
Textualized KG triples used as compact context
Infra Optimization
Prefer retrieval/validation when compute budget prevents heavy fine-tuning
Model Optimization
Accuracy
System Optimization
Use fast KG retrievers to limit latencyCache common subgraphs for frequent queries
Training Optimization
KG-guided masking and fusion for pre-training (knowledge-enhanced pretraining)KG-based synthetic corpora for targeted fine-tuning
Inference Optimization
Retrieve relevant KG triples at runtime instead of retrainingControlled generation to limit model outputs to KG facts

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey may miss recent or niche works due to page and timeframe limits.

Comparisons mix heterogeneous benchmarks and setups, limiting direct apples-to-apples claims.

When Not To Use

For casual conversational agents where perfect factuality is not required.

When no reliable KG covers your domain or building one is infeasible.

Failure Modes

Retriever returns irrelevant or outdated triples, causing confident but wrong answers.

KGs introduce bias or propagate incorrect facts from their sources.

Core Entities

Models

GPT-3GPT-3.5GPT-4PaLMT5Flan-T5LlamaBARTBERTRoBERTa

Metrics

AccuracyTop-KMRRHits@1Exact MatchHuman evaluation

Datasets

WebQSPWebQuestionsComplexWebQuestionsMintakaMetaQAHotpotQA2WikiMultiHopQAWebQZJQASSTWikiTableQuestionsTabFactFEVEROUSSciFact-Open

Benchmarks

KGQA benchmarks (WebQSP, LC-QuAD, MetaQA)Commonsense QA (CommonSenseQA, OpenBookQA)Multi-hop QA (HotpotQA, 2WikiMultiHopQA)Fact verification (FEVEROUS, SciFact-Open)