Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
11
Why It Matters For Business
MindMap makes LLM outputs more factual and inspectable by forcing the model to reason over KG-derived evidence graphs; this reduces hallucination risk in knowledge-heavy apps like medical assistants and increases trust.
Summary TLDR
MindMap is a prompting pipeline that turns retrieved knowledge-graph subgraphs into concise evidence sentences, asks an LLM to merge them into a reasoning graph (a "mind map"), and then prompts the LLM to reason over that graph. On three medical QA datasets the method improves factuality and reduces hallucination versus plain retrieval-augmented prompts and chain/tree-of-thought baselines. The system is robust when the KG contains mismatched facts because the LLM is prompted to combine its own knowledge with retrieved KG evidence.
Problem Statement
Pretrained LLMs can generate fluent but sometimes wrong answers, they are hard to update with new facts, and their reasoning is opaque. The paper asks: can we prompt fixed LLMs with structured KG evidence so the model builds an explicit, inspectable graph-of-thought and uses both KG facts and its implicit knowledge to improve factuality and explainability?
Main Contribution
Introduce MindMap, a three-step prompting pipeline: (1) mine evidence sub-graphs from a KG, (2) convert and aggregate sub-graphs into reasoning graphs, (3) prompt LLMs to build a mind map and answer with graph-backed rationale.
Show that prompting LLMs to reason over aggregated KG sub-graphs yields better factuality and fewer hallucinations than standard document- or KG-based retrieval prompts and tree/chain-of-thought baselines on medical QA.
Propose a simple hallucination quantification method and run ablations that isolate path-based vs neighbor-based evidence and demonstrate complementarity.
Release code to reproduce experiments and analyses.
Key Findings
MindMap improves semantic match (BERTScore F1) on a clinical QA set (GenMedGPT-5k) compared to GPT-3.5 and other retrievers.
MindMap gets better judged factuality by GPT-4 raters and fewer hallucinations on GenMedGPT-5k.
MindMap increases accuracy on a noisy/mismatch-evidence multiple-choice test (ExplainCPE) over GPT-3.5 and most retrievers.
Combining both path-based and neighbor-based evidence beats using either alone and reduces hallucinations.
Results
BERTScore F1 (GenMedGPT-5k)
GPT-4 ranking (average; lower is better)
Hallucination Quantify (higher = less hallucination)
BERTScore F1 (CMCQA)
Accuracy
Ablation: BERTScore F1 (path-only / neighbor-only)
Who Should Care
What To Try In 7 Days
Build a small domain KG (or reuse EMCKG/CMCKG) with key entities and relations.
Implement evidence subgraph extraction (paths + 1-hop neighbors) for typical queries.
Prompt your LLM to convert subgraphs to short evidence sentences and ask for a mind-map style rationale before the final answer.
Agent Features
Planning
- graph-of-thought prompting
Tool Use
- KG retrieval
- document retrieval
Frameworks
- Langchain-style prompting
Optimization Features
Token Efficiency
- pruning and sampling of subgraphs to control prompt size
Reproducibility
Data Urls
- EMCKG and CMCKG construction details referenced in appendix and data sources in text
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance depends on KG coverage and quality; bad KG facts can mislead if not counterbalanced.
- Method increases prompt size and complexity; prompt tokens can grow when aggregating many subgraphs.
- Interpretability of generated mind maps depends on how clearly the LLM links nodes to evidence; they are not formal proofs.
When Not To Use
- When no domain KG exists and building one is infeasible.
- For latency-sensitive services where extra retrieval and aggregation steps add unacceptable delay.
- When absolute clinical-grade reliability is required without human oversight.
Failure Modes
- Over-reliance on incorrect KG triples leads to wrong answers.
- LLM may still hallucinate connections not supported by any evidence graph.
- Prompt templates or exemplar choices may bias the LLM toward certain diagnoses or actions.
Core Entities
Models
- gpt-3.5-turbo
- gpt-4
- Tree-of-Thought (TOT)
Metrics
- BERTScore
- GPT-4 Ranking
- Accuracy
- Hallucination Quantify
Datasets
- GenMedGPT-5k
- CMCQA
- ExplainCPE
Benchmarks
- Hallucination Quantify (introduced)
Context Entities
Datasets
- EMCKG
- CMCKG

