Survey of LLM-based medical agents: architectures, applications, and safety gaps

February 16, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

8

Authors

Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, Yixuan Yuan

Links

Abstract / PDF

Why It Matters For Business

LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.

Summary TLDR

This is a systematic survey of 60 papers (2022–2024) about using large language model (LLM) agents in medicine. It breaks agent systems into profiles, planning, medical reasoning, and external tools. The survey catalogs clinical uses (decision support, documentation, training, operations), reviews benchmarks (static QA, workflow simulations, automated evaluation), and flags hard problems: hallucinations, multimodal fusion, bias, privacy, and system integration. Practical takeaway: LLM agents show promise but are not yet ready for unsupervised clinical use; rigorous, workflow-style evaluation and human oversight are needed.

Problem Statement

LLM-based agents promise to assist clinical tasks but medical settings demand higher safety, multimodal handling, auditability, and cross-department integration than current LLMs and benchmarks provide. The survey asks: how are agent architectures, planning, reasoning, evaluation, and deployment practices meeting those medical requirements?

Main Contribution

Structured taxonomy of LLM-based medical agent components: profile, clinical planning, medical reasoning, and external capacity.

Catalog of application scenarios and representative systems across diagnosis, documentation, training, and service optimization.

Review of evaluation approaches and benchmarks, highlighting gaps between static QA tests and workflow-style clinical evaluation.

Identification of key technical and ethical challenges and concrete future directions (hallucination control, multimodal fusion, real-time correction, integration with physical systems).

Key Findings

Surveyed literature size and scope.

Numbers60 studies reviewed (from ~300 initial hits, 80 shortlisted).

Static QA benchmarks are abundant but limited for clinical workflows.

NumbersMedMCQA contains 194,000 questions across 2,400 topics.

Workflow benchmarks provide richer clinical tests.

NumbersMedChain includes 12,163 cases and 7,338 medical images.

Bias and poor precision exist in current medical LLMs.

NumbersBiasMedQA found precision can fall below 80%, with some models near 50%.

Results

Survey coverage

Value60 studies reviewed

BaselineInitial search ~300 papers

Size of MedMCQA dataset

Value194,000 questions across 2,400 topics

BiasMedQA precision range reported

ValuePrecision often <80%, some models ~50%

Who Should Care

What To Try In 7 Days

Run small workflow-style tests using AgentClinic or MedChain cases to mimic real tasks.

Audit candidate models on local patient samples for bias and precision.

Prototype a retrieval-augmented pipeline with EHR access and guideline checks.

Agent Features

Memory

  • Long-term Memory
  • Retrieval Memory
  • Experience Base

Planning

  • Task Decomposition
  • Adaptive Planning
  • Multi-Agent Collaboration
  • Iterative Self-Evolution

Tool Use

  • Knowledge Graphs
  • Medical Calculators
  • EHR Interfaces
  • Image Analysis
  • Retrieval-augmented Generation

Frameworks

  • ReAct
  • Chain-of-Thought
  • Tree-of-Thought
  • MDAgents
  • Polaris
  • Agent Hospital

Is Agentic

true

Architectures

  • Single-Agent
  • Sequential Task Chain
  • Collaborative Experts
  • Iterative Evolution

Collaboration

  • Multi-Agent Coordination
  • Agent Communication
  • Departmental Organization

Optimization Features

System Optimization

  • Federated learning for cross-site adaptability

Training Optimization

  • RL
  • Simulation-driven iterative training (Agent Hospital)

Inference Optimization

  • Inference-time scaling to allow longer reasoning

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Covers literature up to early 2024; rapid developments after that may not be included.
  • Search focused on English-language publications in major databases.
  • Survey paper: no new empirical experiments or unified benchmark comparisons presented.

When Not To Use

  • Do not deploy agent outputs as autonomous clinical decisions without human oversight.
  • Avoid using current agents for high-stakes care where errors risk patient harm.
  • Do not rely solely on static QA scores to assess clinical readiness.

Failure Modes

  • Hallucinations producing incorrect clinical facts or recommendations.
  • Dataset and algorithmic bias reducing accuracy for underrepresented groups.
  • Integration failures across departments or EHR systems causing incorrect data flow.
  • Privacy leaks from training or inference when handling sensitive records.

Core Entities

Models

  • MedAide
  • MDAgents
  • Agent Hospital
  • Polaris
  • Rx Strategist
  • ColaCare
  • ClinicalLab
  • AI Hospital
  • MedChain
  • AgentClinic

Metrics

  • Accuracy
  • Precision
  • Recall
  • BLEU
  • ROUGE
  • BertScore
  • LLM-based evaluation (ChatCoach)

Datasets

  • MedQA
  • MedMCQA
  • PubMedQA
  • MMLU
  • MIMIC-III
  • MIMIC-IV
  • MVME

Benchmarks

  • MedChain
  • AI Hospital
  • AgentClinic
  • ClinicalLab
  • MedHallBench
  • HaluEval
  • BiasMedQA
  • AI-SCE
  • RJUA-SPs

Context Entities

Models

  • DeepSeek R1 (reinforcement-learned reasoning)
  • O1 / inference-time scaling examples

Metrics

  • Workflow-based simulation metrics
  • Semantic similarity metrics
  • LLM-based scoring

Datasets

  • BiasMedQA
  • MedHallBench
  • HaluEval

Benchmarks

  • MedChain
  • AgentClinic
  • ClinicalLab