MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

August 17, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is practical for retrieval-heavy applications and improves factuality in experiments, but relies on curated KG coverage and careful prompt engineering.

Citations11

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yilin Wen, Zifeng Wang, Jimeng Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.

Who Should Care

Summary TLDR

MindMap is a prompting pipeline that turns retrieved knowledge-graph subgraphs into concise evidence sentences, asks an LLM to merge them into a reasoning graph (a "mind map"), and then prompts the LLM to reason over that graph. On three medical QA datasets the method improves factuality and reduces hallucination versus plain retrieval-augmented prompts and chain/tree-of-thought baselines. The system is robust when the KG contains mismatched facts because the LLM is prompted to combine its own knowledge with retrieved KG evidence.

Problem Statement

Pretrained LLMs can generate fluent but sometimes wrong answers, they are hard to update with new facts, and their reasoning is opaque. The paper asks: can we prompt fixed LLMs with structured KG evidence so the model builds an explicit, inspectable graph-of-thought and uses both KG facts and its implicit knowledge to improve factuality and explainability?

Main Contribution

Introduce MindMap, a three-step prompting pipeline: (1) mine evidence sub-graphs from a KG, (2) convert and aggregate sub-graphs into reasoning graphs, (3) prompt LLMs to build a mind map and answer with graph-backed rationale.

Show that prompting LLMs to reason over aggregated KG sub-graphs yields better factuality and fewer hallucinations than standard document- or KG-based retrieval prompts and tree/chain-of-thought baselines on medical QA.

Key Findings

MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.

NumbersBERTScore F1: MindMap 0.7954 vs GPT-3.5 0.7800 (Table 2)

Practical UseIf you need slightly better semantic alignment to reference medical answers, prompt LLMs with KG-derived mind maps rather than raw retrieved docs.

Evidence RefTable 2

MindMap gets better judged factuality by GPT-4 raters and fewer hallucinations on GenMedGPT-5k.

NumbersGPT-4 rank avg 1.8725 (lower is better); hallucination score 0.6070 vs GPT-3.5 hallucination 0.5563 (Table 2)

Practical UseUse MindMap to reduce hallucination risk in knowledge-heavy prompts; the mind map helps the model cite and cross-check evidence.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BERTScore F1 (GenMedGPT-5k)0.7954 (MindMap)0.7800 (GPT-3.5)+0.0154GenMedGPT-5kTable 2 shows BERTScore F1 per methodTable 2
GPT-4 ranking (average; lower is better)1.8725 (MindMap)4.8571 (GPT-3.5)-2.9846GenMedGPT-5kTable 2 GPT-4 ranking columnTable 2

What To Try In 7 Days

Build a small domain KG (or reuse EMCKG/CMCKG) with key entities and relations.

Implement evidence subgraph extraction (paths + 1-hop neighbors) for typical queries.

Prompt your LLM to convert subgraphs to short evidence sentences and ask for a mind-map style rationale before the final answer.

Agent Features

Planning
graph-of-thought prompting
Tool Use
KG retrievaldocument retrieval
Frameworks
Langchain-style prompting

Optimization Features

Token Efficiency
pruning and sampling of subgraphs to control prompt size

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

EMCKG and CMCKG construction details referenced in appendix and data sources in text

Risks & Boundaries

Limitations

Performance depends on KG coverage and quality; bad KG facts can mislead if not counterbalanced.

Method increases prompt size and complexity; prompt tokens can grow when aggregating many subgraphs.

When Not To Use

When no domain KG exists and building one is infeasible.

For latency-sensitive services where extra retrieval and aggregation steps add unacceptable delay.

Failure Modes

Over-reliance on incorrect KG triples leads to wrong answers.

LLM may still hallucinate connections not supported by any evidence graph.

Core Entities

Models

gpt-3.5-turbogpt-4Tree-of-Thought (TOT)

Metrics

BERTScoreGPT-4 RankingAccuracyHallucination Quantify

Datasets

GenMedGPT-5kCMCQAExplainCPE

Benchmarks

Hallucination Quantify (introduced)

Context Entities

Datasets

EMCKGCMCKG