Overview
Combines a dynamic, confidence-weighted MKG with RAG+CoT; benchmark and ablation results show clear gains, but expert validation and authoritative guideline integration remain necessary for clinical use.
Citations3
Evidence Strength0.70
Confidence0.83
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.
Who Should Care
Summary TLDR
AMG-RAG builds and keeps a Medical Knowledge Graph (MKG) up to date using autonomous LLM agents plus search tools (PubMed, Wikipedia). The MKG stores entities, confidence-scored edges, and short summaries, and is used inside a RAG + Chain-of-Thought pipeline. On MEDQA and MedMCQA, an 8B backbone achieves F1 74.1% and accuracy 66.34%, matching or beating much larger models. PubMed-derived MKGs outperform Wiki-derived ones. The system and examples are published on GitHub.
Problem Statement
LLMs drift outdated in fast-moving medicine and vector-only retrieval struggles with multi-hop, relational queries. Hand-curated medical knowledge graphs are costly to build and stale. The paper aims to automate graph creation and continuous updates, add confidence scores, and tie structured graphs to RAG+reasoning for more current, explainable medical QA.
Main Contribution
An autonomous pipeline where LLM agents extract entities and infer relationships from live searches to build and update a Medical Knowledge Graph (MKG).
A confidence-scored MKG that attaches numeric reliability to edges and summaries to reduce noisy or misleading retrievals.
Key Findings
AMG-RAG (8B) reaches F1 74.1% on MEDQA.
AMG-RAG achieves 66.34% accuracy on MedMCQA, slightly above Meditron-70B.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MEDQA F1 | 74.1% | — | — | MEDQA test | Reported F1 for AMG-RAG (8B) using MKG+CoT | Abstract, Table 1 |
| Accuracy | 73.92% | Wiki-MKG 70.62% | +3.3 pp | MEDQA (ablation) | Comparison of PubMed vs Wiki MKG | Table 3 |
What To Try In 7 Days
Prototype a small MKG for one specialty: ingest PubMed abstracts, load into Neo4j, and connect to an 8B LLM via RAG.
Add simple confidence scores and a traversal threshold to filter low-reliability edges.
Run ablations: compare with/without MKG and with/without chain-of-thought to measure impact.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
System Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Relies on external search tools, which add latency during MKG creation.
Needs structured access to formal clinical guidelines for deployment in care settings.
When Not To Use
For time-critical, real-time triage where any added latency is unacceptable.
In non-medical domains without domain-specific MKGs (untested).
Failure Modes
Noisy retrievals produce incorrect edges that propagate through reasoning.
Confidence scores miscalibrated lead to false trust in spurious relations.

