MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

August 17, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

11

Authors

Yilin Wen, Zifeng Wang, Jimeng Sun

Links

Abstract / PDF

Why It Matters For Business

MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.

Summary TLDR

MindMap is a prompting pipeline that turns retrieved knowledge-graph subgraphs into concise evidence sentences, asks an LLM to merge them into a reasoning graph (a "mind map"), and then prompts the LLM to reason over that graph. On three medical QA datasets the method improves factuality and reduces hallucination versus plain retrieval-augmented prompts and chain/tree-of-thought baselines. The system is robust when the KG contains mismatched facts because the LLM is prompted to combine its own knowledge with retrieved KG evidence.

Problem Statement

Pretrained LLMs can generate fluent but sometimes wrong answers, they are hard to update with new facts, and their reasoning is opaque. The paper asks: can we prompt fixed LLMs with structured KG evidence so the model builds an explicit, inspectable graph-of-thought and uses both KG facts and its implicit knowledge to improve factuality and explainability?

Main Contribution

Introduce MindMap, a three-step prompting pipeline: (1) mine evidence sub-graphs from a KG, (2) convert and aggregate sub-graphs into reasoning graphs, (3) prompt LLMs to build a mind map and answer with graph-backed rationale.

Show that prompting LLMs to reason over aggregated KG sub-graphs yields better factuality and fewer hallucinations than standard document- or KG-based retrieval prompts and tree/chain-of-thought baselines on medical QA.

Propose a simple hallucination quantification method and run ablations that isolate path-based vs neighbor-based evidence and demonstrate complementarity.

Release code to reproduce experiments and analyses.

Key Findings

MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.

NumbersBERTScore F1: MindMap 0.7954 vs GPT-3.5 0.7800 (Table 2)

MindMap gets better judged factuality by GPT-4 raters and fewer hallucinations on GenMedGPT-5k.

NumbersGPT-4 rank avg 1.8725 (lower is better); hallucination score 0.6070 vs GPT-3.5 hallucination 0.5563 (Table 2)

MindMap increases accuracy on a noisy/mismatch-evidence multiple-choice test (ExplainCPE) over GPT-3.5 and most retrievers.

NumbersAccuracy: MindMap 61.7% vs GPT-3.5 52.2% and KG Retriever 42.0% (Table 6)

Combining both path-based and neighbor-based evidence beats using either alone and reduces hallucinations.

NumbersGenMedGPT-5k BERT F1: MindMap 0.7960 vs path-only 0.7002 and neigh-only 0.7072; hallucination quantify improved (Table 8

Results

BERTScore F1 (GenMedGPT-5k)

Value0.7954 (MindMap)

Baseline0.7800 (GPT-3.5)

GPT-4 ranking (average; lower is better)

Value1.8725 (MindMap)

Baseline4.8571 (GPT-3.5)

Hallucination Quantify (higher = less hallucination)

Value0.6070 (MindMap)

Baseline0.5563 (GPT-3.5)

BERTScore F1 (CMCQA)

Value0.9367 (MindMap)

Baseline0.9372 (GPT-3.5)

Accuracy

Value61.7% (MindMap)

Baseline52.2% (GPT-3.5); 72.0% (GPT-4)

Ablation: BERTScore F1 (path-only / neighbor-only)

Valuepath-only 0.7002; neighbor-only 0.7072; MindMap 0.7960

Baselinepath-only / neighbor-only

Who Should Care

What To Try In 7 Days

Build a small domain KG (or reuse EMCKG/CMCKG) with key entities and relations.

Implement evidence subgraph extraction (paths + 1-hop neighbors) for typical queries.

Prompt your LLM to convert subgraphs to short evidence sentences and ask for a mind-map style rationale before the final answer.

Agent Features

Planning

  • graph-of-thought prompting

Tool Use

  • KG retrieval
  • document retrieval

Frameworks

  • Langchain-style prompting

Optimization Features

Token Efficiency

  • pruning and sampling of subgraphs to control prompt size

Reproducibility

Data Urls

  • EMCKG and CMCKG construction details referenced in appendix and data sources in text

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on KG coverage and quality; bad KG facts can mislead if not counterbalanced.
  • Method increases prompt size and complexity; prompt tokens can grow when aggregating many subgraphs.
  • Interpretability of generated mind maps depends on how clearly the LLM links nodes to evidence; they are not formal proofs.

When Not To Use

  • When no domain KG exists and building one is infeasible.
  • For latency-sensitive services where extra retrieval and aggregation steps add unacceptable delay.
  • When absolute clinical-grade reliability is required without human oversight.

Failure Modes

  • Over-reliance on incorrect KG triples leads to wrong answers.
  • LLM may still hallucinate connections not supported by any evidence graph.
  • Prompt templates or exemplar choices may bias the LLM toward certain diagnoses or actions.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4
  • Tree-of-Thought (TOT)

Metrics

  • BERTScore
  • GPT-4 Ranking
  • Accuracy
  • Hallucination Quantify

Datasets

  • GenMedGPT-5k
  • CMCQA
  • ExplainCPE

Benchmarks

  • Hallucination Quantify (introduced)

Context Entities

Datasets

  • EMCKG
  • CMCKG