Knowledge Graph RAG Papers — Parsed & Scored for Practitioners

RoG: Ground LLM plans on knowledge‑graph relation paths for faithful, interpretable KGQA

0.60

0.50

38

RoG reduces hallucinations by grounding LLM reasoning in KG facts and provides traceable, human-readable paths—this improves accuracy and trust on KG-backed QA without retraining every LLM.

Key finding

RoG sets new best scores on standard KGQA benchmarks.

Numbers: WebQSP Hits@1 85.7; F1 70.8. CWQ Hits@1 62.6; F1 56.2.

A practical map of how knowledge graphs and multimodal AI fit together today and where to push next

0.60

0.50

0.60

28

Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.

Key finding

The survey covers more than 300 related papers.

Numbers: ‘over 300 articles’ (abstract)

Survey: using graph structure to make RAG more precise, concise, and context-aware

0.40

0.60

22

GraphRAG injects relational facts into LLM outputs, reducing hallucination and shortening input prompts; this improves accuracy for QA, search, and domain workflows while leveraging existing graph databases.

Key finding

GraphRAG workflow decomposes into three repeatable stages: Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation.

Numbers: 3 stages

Use RAG + PCST to let LLMs 'chat' with very large textual graphs

0.60

0.70

22

If you need natural-language queries over large text-rich graphs, G-Retriever scales to huge graphs, speeds training and inference dramatically, and reduces wrong citations by returning the exact subgraph used to answer.

Key finding

G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.

Numbers: WebQSP: GraphToken 57.05% → G-Retriever 70.49% → G-Retriever+LoRA 73.79%

Survey: Can knowledge graphs reduce hallucinations in large language models?

0.60

0.50

0.70

16

Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.

Key finding

KG-augmented retrieval can dramatically improve QA correctness for small models.

Numbers: reported >80% answer correctness gain on QA (Baek et al.; Sen et al.; Wu et al.)

KG-Agent: a tool-augmented autonomous 7B LLM that reasons step-by-step over knowledge graphs

0.60

0.65

0.70

12

You can get KG-backed, multi-hop reasoning without expensive closed LLM APIs by fine-tuning a 7B open model on ~10K program-like instructions, cutting cost and improving cross-domain use of external KGs.

Key finding

Instruction-tuned KG-Agent (LLaMA2-7B) improves KGQA F1 over prior baselines on in-domain tests.

Numbers: F1 gains: WebQSP +1.7%, CWQ +7.5%, GrailQA +2.7% (Sec 5.2, Table 2)

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

0.60

0.70

0.50

11

MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.

Key finding

MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.

Numbers: BERTScore F1: MindMap 0.7954 vs GPT-3.5 0.7800 (Table 2)

Use an LLM to spot its own factual claims and auto-check them against Wikidata to cut hallucinations

0.60

0.50

10

KGR can reduce factual errors in model outputs, especially for multi-step reasoning tasks, lowering risk in customer-facing answers and automated reporting without retraining large models.

Key finding

KGR raises ChatGPT F1 on Mintaka (complex reasoning) by about 6.2 points over question-relevant KG retrieval (QKR).

Numbers: ChatGPT Mintaka F1: QKR 54.6 -> KGR 60.8 (+6.2)

Use UMLS definitions and relations to make LLM answers more factual and complete for medical questions

0.50

0.40

0.70

8

Injecting curated UMLS content into prompts can raise factuality and completeness without costly model fine-tuning; it is a lower-cost way to make LLM answers safer for medical use, though user readability may require UX work.

Key finding

UMLS augmentation raised LLaMa2-13b-chat ROUGE-1 from 19.07 to 19.97 on LiveQA.

Numbers: R-1 +0.90 (19.07 → 19.97)

Knowledge graph triples GPT-4 accuracy for enterprise QA (16.7% → 54.2%)

0.30

0.45

0.50

7

Adding a knowledge graph layer (ontology + mappings) substantially improves LLM answer accuracy on enterprise SQL: expect major gains for normalized schemas and KPI-style questions.

Key finding

Knowledge-graph context raised GPT-4 execution accuracy from 16.7% to 54.2%.

Numbers: SQL 16.7% → SPARQL 54.2% (Table 1)

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

0.70

0.50

0.60

6

GraphEval pinpoints which facts in an LLM output are ungrounded and raises automatic detector accuracy, enabling targeted fixes and cheaper, explainable QA for production systems.

Key finding

Adding GraphEval to NLI-based detectors raises balanced accuracy on three summarization benchmarks.

Numbers: avg +6.2 balanced-accuracy (SE=1.3) across SummEval, QAGS‑C, QAGS‑X

KG-Rank: combine a medical knowledge graph with triplet ranking to make long-form medical answers more factual

0.60

0.50

5

KG-Rank reduces factual errors in long-form domain answers by selecting relevant KG facts before generation, making prototypes for clinical documentation, help centers, or domain QA more reliable; still require clinician review and careful deployment.

Key finding

KG-Rank raised ROUGE-L on ExpertQA-Bio from 23.00 to 27.20.

Numbers: ROUGE-L 23.00 → 27.20 (+18.3%)

AriGraph: combine a semantic knowledge graph and episodic memory so an LLM agent remembers and plans across long, partially observed text‑en

0.60

0.75

0.70

4

Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.

Key finding

On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.

Numbers: AriGraph 1.0 vs Full History 0.47 (Table 4)

Zep: temporal knowledge-graph memory for agents — faster retrieval and better long-term accuracy

0.80

0.70

0.80

4

Zep returns smaller, temporally-correct context to LLMs, so agents answer complex multi-session and time-sensitive questions more accurately while cutting latency and token costs.

Key finding

Zep edges back to MemGPT on DMR with gpt-4-turbo

Numbers: 94.8% vs 93.4% (DMR, gpt-4-turbo)

E-KELL: a KG-backed LLM system that guides decisions with standards-based evidence to cut hallucinations

0.50

0.60

4

For safety-critical operations, E-KELL-style KG+LLM reduces hallucination and ensures answers trace back to standards, lowering legal and operational risk while making guidance faster and more auditable.

Key finding

E-KELL produced factually correct and standards-compliant answers on the 10 evaluated queries.

Numbers: Factually correct 10/10; In compliance with standards 10/10 (Table 1)

KnowGPT: use RL to pick concise KG facts and a bandit to pick prompt formats for closed‑box LLMs

0.60

0.70

4

KnowGPT upgrades closed‑box LLM accuracy using existing KGs while trimming prompt size and API costs. It lets teams improve domain QA without fine‑tuning large models or owning model weights.

Key finding

KnowGPT raises QA accuracy substantially over baseline LLMs on three datasets

Numbers: Avg +23.7% vs GPT‑3.5; Avg +2.9% vs GPT‑4

STARK: a large benchmark testing LLM-based retrieval on semi-structured knowledge (text + graph)

0.60

0.70

0.55

4

Search and recommendation systems often need to reason over both product text and structured relationships; STARK shows many current retrievers miss important multi-hop or relational signals, so products relying on naive retrieval risk poor search quality or unsafe omissions.

Key finding

Classic sparse baseline (BM25) is still competitive and often outperforms small dense retrievers on STARK.

Numbers: STARK-AMAZON (synth): BM25 Hit@1 44.94 vs DPR Hit@1 15.29 (Table 6)

Automated agent-driven medical knowledge graphs improve medical QA and rival much larger models

0.60

0.70

3

An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.

Key finding

AMG-RAG (8B) reaches F1 74.1% on MEDQA.

Numbers: F1 = 74.1% (MEDQA)

Use knowledge-graph paths + chain-of-thought to guide LLMs for domain QA with only three LLM calls

0.60

0.45

0.60

3

RoK gives domain QA systems more accurate, interpretable answers while cutting LLM calls and API cost by structuring knowledge as ranked KG paths.

Key finding

RoK improves key-entity match accuracy on GenMedGPT-5k to 81.3% from GPT-3.5's 52.6%.

Numbers: GenMedGPT-5k key-entity match: RoK 81.3% vs GPT-3.5 52.6%

FiDeLiS: narrow retrieval + deductive beam search to ground LLM answers in KG paths

0.60

0.70

3

FiDeLiS improves factual QA without model retraining by combining KG retrieval and stepwise logic checks, raising answer accuracy and cutting runtime—useful where auditability and verifiable facts matter.

Key finding

FiDeLiS improves top-answer accuracy on WebQSP with strong LLMs.

Numbers: WebQSP Hits@1 84.39% (FiDeLiS, GPT‑4‑turbo) vs 81.84% (ToG)

Graphusion: zero-shot LLM pipeline that builds and fuses scientific concept graphs for NLP tutoring

0.60

3

Graphusion cuts expert labeling by using LLMs plus a fusion step to build domain concept graphs, which can immediately improve tutoring and QA services without large supervised datasets.

Key finding

LLM zero-shot link prediction with retrieval outperforms supervised baselines on LectureBankCD (NLP).

Numbers: GPT-4o (RAG) Accuracy 0.8117 vs BERT 0.7088 (+0.1029)

Inject knowledge-graph vectors and correlation matrices into transformer layers to improve GLUE tasks.

0.40

0.50

0.30

3

Injecting knowledge-graph embeddings into transformer internals can raise NLU accuracy and cut labeling needs; this helps deliver stronger models faster where domain or commonsense context matters.

Key finding

Deep infusion (vectors + attention across all blocks) raises GLUE task scores over baseline XLNet.

Numbers: XLNet MNLI: baseline 72.3% -> deep 88.53% (+16.23pp) (Table 1)

Use an LLM to break sentences, pull subgraphs, and reason over knowledge graphs

0.40

0.50

0.30

3

KG-GPT lets you add structured KG reasoning to LLM pipelines with little labeled data. Use it to prototype fact verification or KGQA systems quickly before investing in custom supervised retrievers.

Key finding

KG-GPT reaches 72.68% accuracy on FACTKG using evidence retrieval and few-shot prompts.

Numbers: Accuracy 72.68% (KG-GPT) vs 77.65% (GEAR)

VideoRAG: index and search unlimited‑length videos with graph grounding plus multi‑modal retrieval

0.60

0.70

0.40

2

VideoRAG enables searchable QA and summarization across many long videos, unlocking education, media-archive search, and customer-support video analytics without retraining large models.

Key finding

VideoRAG wins more LLM head-to-head judgments than standard RAG baselines

Numbers: VideoRAG chosen 53.26% vs baselines' 46.74% (Overall Winner, Table 2)