MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Overview

Decision SnapshotReady For Pilot

The method is practical for retrieval-heavy applications and improves factuality in experiments, but relies on curated KG coverage and careful prompt engineering.

Citations11

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yilin Wen, Zifeng Wang, Jimeng Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

MindMap is a prompting pipeline that turns retrieved knowledge-graph subgraphs into concise evidence sentences, asks an LLM to merge them into a reasoning graph (a "mind map"), and then prompts the LLM to reason over that graph. On three medical QA datasets the method improves factuality and reduces hallucination versus plain retrieval-augmented prompts and chain/tree-of-thought baselines. The system is robust when the KG contains mismatched facts because the LLM is prompted to combine its own knowledge with retrieved KG evidence.

Problem Statement

Pretrained LLMs can generate fluent but sometimes wrong answers, they are hard to update with new facts, and their reasoning is opaque. The paper asks: can we prompt fixed LLMs with structured KG evidence so the model builds an explicit, inspectable graph-of-thought and uses both KG facts and its implicit knowledge to improve factuality and explainability?

Main Contribution

Introduce MindMap, a three-step prompting pipeline: (1) mine evidence sub-graphs from a KG, (2) convert and aggregate sub-graphs into reasoning graphs, (3) prompt LLMs to build a mind map and answer with graph-backed rationale.

Show that prompting LLMs to reason over aggregated KG sub-graphs yields better factuality and fewer hallucinations than standard document- or KG-based retrieval prompts and tree/chain-of-thought baselines on medical QA.

Key Findings

MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.

NumbersBERTScore F1: MindMap 0.7954 vs GPT-3.5 0.7800 (Table 2)

Practical UseIf you need slightly better semantic alignment to reference medical answers, prompt LLMs with KG-derived mind maps rather than raw retrieved docs.

Evidence RefTable 2

MindMap gets better judged factuality by GPT-4 raters and fewer hallucinations on GenMedGPT-5k.

NumbersGPT-4 rank avg 1.8725 (lower is better); hallucination score 0.6070 vs GPT-3.5 hallucination 0.5563 (Table 2)

Practical UseUse MindMap to reduce hallucination risk in knowledge-heavy prompts; the mind map helps the model cite and cross-check evidence.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BERTScore F1 (GenMedGPT-5k)	0.7954 (MindMap)	0.7800 (GPT-3.5)	+0.0154	GenMedGPT-5k	Table 2 shows BERTScore F1 per method	Table 2
GPT-4 ranking (average; lower is better)	1.8725 (MindMap)	4.8571 (GPT-3.5)	-2.9846	GenMedGPT-5k	Table 2 GPT-4 ranking column	Table 2

What To Try In 7 Days

Build a small domain KG (or reuse EMCKG/CMCKG) with key entities and relations.

Implement evidence subgraph extraction (paths + 1-hop neighbors) for typical queries.

Prompt your LLM to convert subgraphs to short evidence sentences and ask for a mind-map style rationale before the final answer.

Agent Features

Planning

graph-of-thought prompting

Tool Use

KG retrievaldocument retrieval

Frameworks

Langchain-style prompting

Optimization Features

Token Efficiency

pruning and sampling of subgraphs to control prompt size

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/wylwilling/MindMap

Data URLs

EMCKG and CMCKG construction details referenced in appendix and data sources in text

Risks & Boundaries

Limitations

Performance depends on KG coverage and quality; bad KG facts can mislead if not counterbalanced.

Method increases prompt size and complexity; prompt tokens can grow when aggregating many subgraphs.

When Not To Use

When no domain KG exists and building one is infeasible.

For latency-sensitive services where extra retrieval and aggregation steps add unacceptable delay.

Failure Modes

Over-reliance on incorrect KG triples leads to wrong answers.

LLM may still hallucinate connections not supported by any evidence graph.

Core Entities

Models

gpt-3.5-turbogpt-4Tree-of-Thought (TOT)

Metrics

BERTScoreGPT-4 RankingAccuracyHallucination Quantify

Datasets

GenMedGPT-5kCMCQAExplainCPE

Benchmarks

Hallucination Quantify (introduced)

Context Entities

Datasets

EMCKGCMCKG

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.

MindMap gets better judged factuality by GPT-4 raters and fewer hallucinations on GenMedGPT-5k.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

RL fine-tuning raises visual reasoning scores but weakens reasoning faithfulness and robustness to misleading text

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

Build expert element-based test sets and use a chain-of-thought prompt (SumCoT) to get LLMs to write more complete news summaries

Key finding

Which LLM and reasoning setup solves Raven-style visual puzzles best?

Key finding

Embed executable code in prompts to ground LLM reasoning and cut hallucinations

Key finding