Overview
The method is practical for retrieval-heavy applications and improves factuality in experiments, but relies on curated KG coverage and careful prompt engineering.
Citations11
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.
Who Should Care
Summary TLDR
MindMap is a prompting pipeline that turns retrieved knowledge-graph subgraphs into concise evidence sentences, asks an LLM to merge them into a reasoning graph (a "mind map"), and then prompts the LLM to reason over that graph. On three medical QA datasets the method improves factuality and reduces hallucination versus plain retrieval-augmented prompts and chain/tree-of-thought baselines. The system is robust when the KG contains mismatched facts because the LLM is prompted to combine its own knowledge with retrieved KG evidence.
Problem Statement
Pretrained LLMs can generate fluent but sometimes wrong answers, they are hard to update with new facts, and their reasoning is opaque. The paper asks: can we prompt fixed LLMs with structured KG evidence so the model builds an explicit, inspectable graph-of-thought and uses both KG facts and its implicit knowledge to improve factuality and explainability?
Main Contribution
Introduce MindMap, a three-step prompting pipeline: (1) mine evidence sub-graphs from a KG, (2) convert and aggregate sub-graphs into reasoning graphs, (3) prompt LLMs to build a mind map and answer with graph-backed rationale.
Show that prompting LLMs to reason over aggregated KG sub-graphs yields better factuality and fewer hallucinations than standard document- or KG-based retrieval prompts and tree/chain-of-thought baselines on medical QA.
Key Findings
MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.
MindMap gets better judged factuality by GPT-4 raters and fewer hallucinations on GenMedGPT-5k.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BERTScore F1 (GenMedGPT-5k) | 0.7954 (MindMap) | 0.7800 (GPT-3.5) | +0.0154 | GenMedGPT-5k | Table 2 shows BERTScore F1 per method | Table 2 |
| GPT-4 ranking (average; lower is better) | 1.8725 (MindMap) | 4.8571 (GPT-3.5) | -2.9846 | GenMedGPT-5k | Table 2 GPT-4 ranking column | Table 2 |
What To Try In 7 Days
Build a small domain KG (or reuse EMCKG/CMCKG) with key entities and relations.
Implement evidence subgraph extraction (paths + 1-hop neighbors) for typical queries.
Prompt your LLM to convert subgraphs to short evidence sentences and ask for a mind-map style rationale before the final answer.
Agent Features
Planning
Tool Use
Frameworks
Optimization Features
Token Efficiency
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Performance depends on KG coverage and quality; bad KG facts can mislead if not counterbalanced.
Method increases prompt size and complexity; prompt tokens can grow when aggregating many subgraphs.
When Not To Use
When no domain KG exists and building one is infeasible.
For latency-sensitive services where extra retrieval and aggregation steps add unacceptable delay.
Failure Modes
Over-reliance on incorrect KG triples leads to wrong answers.
LLM may still hallucinate connections not supported by any evidence graph.

