Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.
Summary TLDR
AMG-RAG builds and keeps a Medical Knowledge Graph (MKG) up to date using autonomous LLM agents plus search tools (PubMed, Wikipedia). The MKG stores entities, confidence-scored edges, and short summaries, and is used inside a RAG + Chain-of-Thought pipeline. On MEDQA and MedMCQA, an 8B backbone achieves F1 74.1% and accuracy 66.34%, matching or beating much larger models. PubMed-derived MKGs outperform Wiki-derived ones. The system and examples are published on GitHub.
Problem Statement
LLMs drift outdated in fast-moving medicine and vector-only retrieval struggles with multi-hop, relational queries. Hand-curated medical knowledge graphs are costly to build and stale. The paper aims to automate graph creation and continuous updates, add confidence scores, and tie structured graphs to RAG+reasoning for more current, explainable medical QA.
Main Contribution
An autonomous pipeline where LLM agents extract entities and infer relationships from live searches to build and update a Medical Knowledge Graph (MKG).
A confidence-scored MKG that attaches numeric reliability to edges and summaries to reduce noisy or misleading retrievals.
A graph-conditioned RAG+Chain-of-Thought (CoT) inference pipeline that traverses the MKG with thresholding to produce interpretable, multi-hop medical answers.
Key Findings
AMG-RAG (8B) reaches F1 74.1% on MEDQA.
AMG-RAG achieves 66.34% accuracy on MedMCQA, slightly above Meditron-70B.
PubMed-based MKG outperforms Wiki-based MKG on MEDQA.
Removing MKG or CoT cuts MEDQA accuracy by ~6–7 points.
The automatically built MKG is large and was validated by experts.
Results
MEDQA F1
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Prototype a small MKG for one specialty: ingest PubMed abstracts, load into Neo4j, and connect to an 8B LLM via RAG.
Add simple confidence scores and a traversal threshold to filter low-reliability edges.
Run ablations: compare with/without MKG and with/without chain-of-thought to measure impact.
Agent Features
Memory
- Retrieval memory (Chroma vector store)
- Long-term structured memory (Neo4j MKG)
Planning
- Multi-step reasoning via CoT
- Adaptive graph traversal (BFS/DFS)
Tool Use
- PubMedSearch
- WikiSearch
- Chroma vector DB
- Neo4j graph DB
Frameworks
- RAG
- Chain-of-Thought
Is Agentic
true
Architectures
- LLM-driven agents
- RAG + Chain-of-Thought
- Graph-conditioned retrieval
Collaboration
- Single-agent autonomous pipelines with external tools
Optimization Features
Infra Optimization
- Use Neo4j for efficient graph traversal; Chroma for vector search
System Optimization
- Pre-build MKG to reduce online search latency
Reproducibility
Data Urls
- MEDQA (public)
- MedMCQA (public)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Relies on external search tools, which add latency during MKG creation.
- Needs structured access to formal clinical guidelines for deployment in care settings.
- Quality depends on retrieval sources; PubMed outperformed Wikipedia in tests.
- Automated relation inference can miss clinical nuances; expert review required.
When Not To Use
- For time-critical, real-time triage where any added latency is unacceptable.
- In non-medical domains without domain-specific MKGs (untested).
- As a sole decision maker for clinical treatment without human oversight.
Failure Modes
- Noisy retrievals produce incorrect edges that propagate through reasoning.
- Confidence scores miscalibrated lead to false trust in spurious relations.
- Missing or sparse MKG coverage for niche topics causes fallback to weaker evidence.
- Over-reliance on LLM-inferred relationships without expert checks.
Core Entities
Models
- GPT4o-mini (∼8B)
Metrics
- Accuracy
- F1
Datasets
- MEDQA
- MedMCQA
Benchmarks
- MEDQA
- MedMCQA
Context Entities
Models
- Med-Gemini
- GPT-4
- Meditron-70B
- Flan-PaLM

