Survey: Can knowledge graphs reduce hallucinations in large language models?

November 14, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

16

Authors

Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, Huan Liu

Links

Abstract / PDF

Why It Matters For Business

Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.

Summary TLDR

This survey reviews methods that add structured knowledge (knowledge graphs, KGs) to large language models to reduce hallucinations. It groups approaches into three practical stages: KG-augmented inference (retrieval, reasoning, controlled generation), KG-aware training (pre-training and fine-tuning), and KG-based validation (fact-checking). The authors report that KG retrieval often boosts small-model QA accuracy substantially (papers report >80% improvements on evaluated QA tasks) and that KG-guided step-wise reasoning can raise reasoning accuracy (e.g., RoG raised ChatGPT from 66.8% to 85.7% on its tests). The survey highlights trade-offs: retrieval and validation are low-cost but rely on

Problem Statement

LLMs often produce plausible-sounding but incorrect statements ('hallucinations') because their internal knowledge is incomplete or outdated. The paper asks: can structured external knowledge (knowledge graphs) be added at inference, training, or validation stages to reduce hallucinations and improve reasoning?

Main Contribution

A concise taxonomy that groups KG-augmentation methods into Knowledge-Aware Inference, Knowledge-Aware Training, and Knowledge-Aware Validation.

A comparison table of representative methods, datasets, LLMs, and training costs to help pick an approach.

A practical synthesis of empirical findings, limits, trends, and open directions for reducing hallucinations with KGs.

Key Findings

KG-augmented retrieval can dramatically improve QA correctness for small models.

Numbersreported >80% answer correctness gain on QA (Baek et al.; Sen et al.; Wu et al.)

KG-guided stepwise reasoning substantially raises reasoning accuracy on evaluated tasks.

NumbersChatGPT accuracy rose from 66.8% to 85.7% with RoG (Luo et al., 2023)

KG-based methods can achieve high domain accuracy when paired with domain graphs.

Numbersdisease diagnosis/drug recommendation accuracy reported at 88.2% (MindMap, Wen et al. 2023)

Pre-training or heavy fine-tuning with KGs improves domain performance but is costly.

Numbersmultiple works require large compute and long training (Table 1 summaries)

KG validation (fact-checking) reduces hallucinations but adds runtime cost and can miss gaps.

Numberstrade-off described in Section 4.3; increased compute and partial coverage noted (Kang et al. 2022b)

Results

QA answer correctness (small LMs)

Value>80% improvement reported on evaluated QA tasks when augmented with KG retrieval

Baselineno KG retrieval

Accuracy

Value85.7% with KG-guided reasoning

Baseline66.8% without KG-guided reasoning

Accuracy

Value88.2% disease diagnosis/drug recommendation accuracy reported

Baselineunspecified baseline in MindMap (Wen et al. 2023)

Who Should Care

What To Try In 7 Days

Build a simple KG-backed retriever and prepend retrieved triples to model prompts for QA experiments.

Run KG-based post-generation fact checks on a sample of high-stakes outputs to measure hallucination rates.

Pilot chain-of-thought + KG retrieval on a handful of multi-step queries to compare accuracy vs. baseline.

Agent Features

Memory

  • external KG as non-parametric memory

Tool Use

  • KG retrievers
  • SPARQL or structured query interfaces
  • prompting frameworks (LangChain, LlamaIndex)

Frameworks

  • RAG
  • StructGPT
  • KICGPT

Architectures

  • retrieval-augmented pipelines
  • graph-text joint encoders

Optimization Features

Token Efficiency

  • Textualized KG triples used as compact context

Infra Optimization

  • Prefer retrieval/validation when compute budget prevents heavy fine-tuning

Model Optimization

  • Accuracy

System Optimization

  • Use fast KG retrievers to limit latency
  • Cache common subgraphs for frequent queries

Training Optimization

  • KG-guided masking and fusion for pre-training (knowledge-enhanced pretraining)
  • KG-based synthetic corpora for targeted fine-tuning

Inference Optimization

  • Retrieve relevant KG triples at runtime instead of retraining
  • Controlled generation to limit model outputs to KG facts

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Survey may miss recent or niche works due to page and timeframe limits.
  • Comparisons mix heterogeneous benchmarks and setups, limiting direct apples-to-apples claims.
  • KG methods depend on KG coverage and quality; limited or biased KGs reduce benefits.
  • Pretraining/fine-tuning approaches are resource-intensive and less portable.

When Not To Use

  • For casual conversational agents where perfect factuality is not required.
  • When no reliable KG covers your domain or building one is infeasible.
  • When low-latency, zero-additional-infrastructure inference is mandatory.

Failure Modes

  • Retriever returns irrelevant or outdated triples, causing confident but wrong answers.
  • KGs introduce bias or propagate incorrect facts from their sources.
  • Validation misses gaps in the KG, giving a false sense of safety.
  • KG integration increases latency and system complexity.

Core Entities

Models

  • GPT-3
  • GPT-3.5
  • GPT-4
  • PaLM
  • T5
  • Flan-T5
  • Llama
  • BART
  • BERT
  • RoBERTa

Metrics

  • Accuracy
  • Top-K
  • MRR
  • Hits@1
  • Exact Match
  • Human evaluation

Datasets

  • WebQSP
  • WebQuestions
  • ComplexWebQuestions
  • Mintaka
  • MetaQA
  • HotpotQA
  • 2WikiMultiHopQA
  • WebQ
  • ZJQA
  • SST
  • WikiTableQuestions
  • TabFact
  • FEVEROUS
  • SciFact-Open

Benchmarks

  • KGQA benchmarks (WebQSP, LC-QuAD, MetaQA)
  • Commonsense QA (CommonSenseQA, OpenBookQA)
  • Multi-hop QA (HotpotQA, 2WikiMultiHopQA)
  • Fact verification (FEVEROUS, SciFact-Open)