Automated agent-driven medical knowledge graphs improve medical QA and rival much larger models

Overview

Decision SnapshotNeeds Validation

Combines a dynamic, confidence-weighted MKG with RAG+CoT; benchmark and ablation results show clear gains, but expert validation and authoritative guideline integration remain necessary for clinical use.

Citations3

Evidence Strength0.70

Confidence0.83

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L. Parker, Rahul G. Krishnan, Milad Lankarany

Links

Abstract / PDF / Code / Data

Why It Matters For Business

An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

AMG-RAG builds and keeps a Medical Knowledge Graph (MKG) up to date using autonomous LLM agents plus search tools (PubMed, Wikipedia). The MKG stores entities, confidence-scored edges, and short summaries, and is used inside a RAG + Chain-of-Thought pipeline. On MEDQA and MedMCQA, an 8B backbone achieves F1 74.1% and accuracy 66.34%, matching or beating much larger models. PubMed-derived MKGs outperform Wiki-derived ones. The system and examples are published on GitHub.

Problem Statement

LLMs drift outdated in fast-moving medicine and vector-only retrieval struggles with multi-hop, relational queries. Hand-curated medical knowledge graphs are costly to build and stale. The paper aims to automate graph creation and continuous updates, add confidence scores, and tie structured graphs to RAG+reasoning for more current, explainable medical QA.

Main Contribution

An autonomous pipeline where LLM agents extract entities and infer relationships from live searches to build and update a Medical Knowledge Graph (MKG).

A confidence-scored MKG that attaches numeric reliability to edges and summaries to reduce noisy or misleading retrievals.

Key Findings

AMG-RAG (8B) reaches F1 74.1% on MEDQA.

NumbersF1 = 74.1% (MEDQA)

Practical UseYou can match much larger models on MEDQA by adding a dynamic MKG plus CoT reasoning to a small LLM.

Evidence RefAbstract, Fig.1, Table 1

AMG-RAG achieves 66.34% accuracy on MedMCQA, slightly above Meditron-70B.

NumbersAcc = 66.34% vs Meditron 66.0% (MedMCQA)

Practical UseA compact (8B) system with MKG + retrieval can compete with 70B models on medical multiple-choice tasks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MEDQA F1	74.1%	—	—	MEDQA test	Reported F1 for AMG-RAG (8B) using MKG+CoT	Abstract, Table 1
Accuracy	73.92%	Wiki-MKG 70.62%	+3.3 pp	MEDQA (ablation)	Comparison of PubMed vs Wiki MKG	Table 3

What To Try In 7 Days

Prototype a small MKG for one specialty: ingest PubMed abstracts, load into Neo4j, and connect to an 8B LLM via RAG.

Add simple confidence scores and a traversal threshold to filter low-reliability edges.

Run ablations: compare with/without MKG and with/without chain-of-thought to measure impact.

Agent Features

Memory

Retrieval memory (Chroma vector store)Long-term structured memory (Neo4j MKG)

Planning

Multi-step reasoning via CoTAdaptive graph traversal (BFS/DFS)

Tool Use

PubMedSearchWikiSearchChroma vector DBNeo4j graph DB

Frameworks

RAGChain-of-Thought

Is Agentic

Yes

Architectures

LLM-driven agentsRAG + Chain-of-ThoughtGraph-conditioned retrieval

Collaboration

Single-agent autonomous pipelines with external tools

Optimization Features

Infra Optimization

Use Neo4j for efficient graph traversal; Chroma for vector search

System Optimization

Pre-build MKG to reduce online search latency

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/MrRezaeiUofT/AMG-RAG

Data URLs

MEDQA (public)MedMCQA (public)

Risks & Boundaries

Limitations

Relies on external search tools, which add latency during MKG creation.

Needs structured access to formal clinical guidelines for deployment in care settings.

When Not To Use

For time-critical, real-time triage where any added latency is unacceptable.

In non-medical domains without domain-specific MKGs (untested).

Failure Modes

Noisy retrievals produce incorrect edges that propagate through reasoning.

Confidence scores miscalibrated lead to false trust in spurious relations.

Core Entities

Models

GPT4o-mini (∼8B)

Metrics

AccuracyF1

Datasets

MEDQAMedMCQA

Benchmarks

MEDQAMedMCQA

Context Entities

Models

Med-GeminiGPT-4Meditron-70BFlan-PaLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AMG-RAG (8B) reaches F1 74.1% on MEDQA.

AMG-RAG achieves 66.34% accuracy on MedMCQA, slightly above Meditron-70B.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding