Automated agent-driven medical knowledge graphs improve medical QA and rival much larger models

February 18, 20257 min

Overview

Decision SnapshotNeeds Validation

Combines a dynamic, confidence-weighted MKG with RAG+CoT; benchmark and ablation results show clear gains, but expert validation and authoritative guideline integration remain necessary for clinical use.

Citations3

Evidence Strength0.70

Confidence0.83

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L. Parker, Rahul G. Krishnan, Milad Lankarany

Links

Abstract / PDF / Code / Data

Why It Matters For Business

An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.

Who Should Care

Summary TLDR

AMG-RAG builds and keeps a Medical Knowledge Graph (MKG) up to date using autonomous LLM agents plus search tools (PubMed, Wikipedia). The MKG stores entities, confidence-scored edges, and short summaries, and is used inside a RAG + Chain-of-Thought pipeline. On MEDQA and MedMCQA, an 8B backbone achieves F1 74.1% and accuracy 66.34%, matching or beating much larger models. PubMed-derived MKGs outperform Wiki-derived ones. The system and examples are published on GitHub.

Problem Statement

LLMs drift outdated in fast-moving medicine and vector-only retrieval struggles with multi-hop, relational queries. Hand-curated medical knowledge graphs are costly to build and stale. The paper aims to automate graph creation and continuous updates, add confidence scores, and tie structured graphs to RAG+reasoning for more current, explainable medical QA.

Main Contribution

An autonomous pipeline where LLM agents extract entities and infer relationships from live searches to build and update a Medical Knowledge Graph (MKG).

A confidence-scored MKG that attaches numeric reliability to edges and summaries to reduce noisy or misleading retrievals.

Key Findings

AMG-RAG (8B) reaches F1 74.1% on MEDQA.

NumbersF1 = 74.1% (MEDQA)

Practical UseYou can match much larger models on MEDQA by adding a dynamic MKG plus CoT reasoning to a small LLM.

Evidence RefAbstract, Fig.1, Table 1

AMG-RAG achieves 66.34% accuracy on MedMCQA, slightly above Meditron-70B.

NumbersAcc = 66.34% vs Meditron 66.0% (MedMCQA)

Practical UseA compact (8B) system with MKG + retrieval can compete with 70B models on medical multiple-choice tasks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MEDQA F174.1%MEDQA testReported F1 for AMG-RAG (8B) using MKG+CoTAbstract, Table 1
Accuracy73.92%Wiki-MKG 70.62%+3.3 ppMEDQA (ablation)Comparison of PubMed vs Wiki MKGTable 3

What To Try In 7 Days

Prototype a small MKG for one specialty: ingest PubMed abstracts, load into Neo4j, and connect to an 8B LLM via RAG.

Add simple confidence scores and a traversal threshold to filter low-reliability edges.

Run ablations: compare with/without MKG and with/without chain-of-thought to measure impact.

Agent Features

Memory
Retrieval memory (Chroma vector store)Long-term structured memory (Neo4j MKG)
Planning
Multi-step reasoning via CoTAdaptive graph traversal (BFS/DFS)
Tool Use
PubMedSearchWikiSearchChroma vector DBNeo4j graph DB
Frameworks
RAGChain-of-Thought
Is Agentic

Yes

Architectures
LLM-driven agentsRAG + Chain-of-ThoughtGraph-conditioned retrieval
Collaboration
Single-agent autonomous pipelines with external tools

Optimization Features

Infra Optimization
Use Neo4j for efficient graph traversal; Chroma for vector search
System Optimization
Pre-build MKG to reduce online search latency

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

MEDQA (public)MedMCQA (public)

Risks & Boundaries

Limitations

Relies on external search tools, which add latency during MKG creation.

Needs structured access to formal clinical guidelines for deployment in care settings.

When Not To Use

For time-critical, real-time triage where any added latency is unacceptable.

In non-medical domains without domain-specific MKGs (untested).

Failure Modes

Noisy retrievals produce incorrect edges that propagate through reasoning.

Confidence scores miscalibrated lead to false trust in spurious relations.

Core Entities

Models

GPT4o-mini (∼8B)

Metrics

AccuracyF1

Datasets

MEDQAMedMCQA

Benchmarks

MEDQAMedMCQA

Context Entities

Models

Med-GeminiGPT-4Meditron-70BFlan-PaLM