Automated agent-driven medical knowledge graphs improve medical QA and rival much larger models

February 18, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

3

Authors

Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L. Parker, Rahul G. Krishnan, Milad Lankarany

Links

Abstract / PDF

Why It Matters For Business

An automated, confidence-scored medical knowledge graph lets smaller LLMs deliver near state-of-the-art medical QA, reducing compute cost and enabling more interpretable, up-to-date answers.

Summary TLDR

AMG-RAG builds and keeps a Medical Knowledge Graph (MKG) up to date using autonomous LLM agents plus search tools (PubMed, Wikipedia). The MKG stores entities, confidence-scored edges, and short summaries, and is used inside a RAG + Chain-of-Thought pipeline. On MEDQA and MedMCQA, an 8B backbone achieves F1 74.1% and accuracy 66.34%, matching or beating much larger models. PubMed-derived MKGs outperform Wiki-derived ones. The system and examples are published on GitHub.

Problem Statement

LLMs drift outdated in fast-moving medicine and vector-only retrieval struggles with multi-hop, relational queries. Hand-curated medical knowledge graphs are costly to build and stale. The paper aims to automate graph creation and continuous updates, add confidence scores, and tie structured graphs to RAG+reasoning for more current, explainable medical QA.

Main Contribution

An autonomous pipeline where LLM agents extract entities and infer relationships from live searches to build and update a Medical Knowledge Graph (MKG).

A confidence-scored MKG that attaches numeric reliability to edges and summaries to reduce noisy or misleading retrievals.

A graph-conditioned RAG+Chain-of-Thought (CoT) inference pipeline that traverses the MKG with thresholding to produce interpretable, multi-hop medical answers.

Key Findings

AMG-RAG (8B) reaches F1 74.1% on MEDQA.

NumbersF1 = 74.1% (MEDQA)

AMG-RAG achieves 66.34% accuracy on MedMCQA, slightly above Meditron-70B.

NumbersAcc = 66.34% vs Meditron 66.0% (MedMCQA)

PubMed-based MKG outperforms Wiki-based MKG on MEDQA.

NumbersPubMed-MKG acc = 73.92% vs Wiki-MKG = 70.62%

Removing MKG or CoT cuts MEDQA accuracy by ~6–7 points.

NumbersNo-MK: 67.16%; No-MKG&CoT: 66.69% (from 73.92%)

The automatically built MKG is large and was validated by experts.

Numbers≈76,681 nodes, 354,299 edges; LLM expert scores ~8.8–8.9/10

Results

MEDQA F1

Value74.1%

Accuracy

Value73.92%

BaselineWiki-MKG 70.62%

Accuracy

Value66.34%

BaselineMeditron-70B 66.0%

Accuracy

Value67.16%

BaselinePubMed-MKG 73.92%

Who Should Care

What To Try In 7 Days

Prototype a small MKG for one specialty: ingest PubMed abstracts, load into Neo4j, and connect to an 8B LLM via RAG.

Add simple confidence scores and a traversal threshold to filter low-reliability edges.

Run ablations: compare with/without MKG and with/without chain-of-thought to measure impact.

Agent Features

Memory

  • Retrieval memory (Chroma vector store)
  • Long-term structured memory (Neo4j MKG)

Planning

  • Multi-step reasoning via CoT
  • Adaptive graph traversal (BFS/DFS)

Tool Use

  • PubMedSearch
  • WikiSearch
  • Chroma vector DB
  • Neo4j graph DB

Frameworks

  • RAG
  • Chain-of-Thought

Is Agentic

true

Architectures

  • LLM-driven agents
  • RAG + Chain-of-Thought
  • Graph-conditioned retrieval

Collaboration

  • Single-agent autonomous pipelines with external tools

Optimization Features

Infra Optimization

  • Use Neo4j for efficient graph traversal; Chroma for vector search

System Optimization

  • Pre-build MKG to reduce online search latency

Reproducibility

Data Urls

  • MEDQA (public)
  • MedMCQA (public)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Relies on external search tools, which add latency during MKG creation.
  • Needs structured access to formal clinical guidelines for deployment in care settings.
  • Quality depends on retrieval sources; PubMed outperformed Wikipedia in tests.
  • Automated relation inference can miss clinical nuances; expert review required.

When Not To Use

  • For time-critical, real-time triage where any added latency is unacceptable.
  • In non-medical domains without domain-specific MKGs (untested).
  • As a sole decision maker for clinical treatment without human oversight.

Failure Modes

  • Noisy retrievals produce incorrect edges that propagate through reasoning.
  • Confidence scores miscalibrated lead to false trust in spurious relations.
  • Missing or sparse MKG coverage for niche topics causes fallback to weaker evidence.
  • Over-reliance on LLM-inferred relationships without expert checks.

Core Entities

Models

  • GPT4o-mini (∼8B)

Metrics

  • Accuracy
  • F1

Datasets

  • MEDQA
  • MedMCQA

Benchmarks

  • MEDQA
  • MedMCQA

Context Entities

Models

  • Med-Gemini
  • GPT-4
  • Meditron-70B
  • Flan-PaLM