Survey of LLM-based medical agents: architectures, applications, and safety gaps

February 16, 20257 min

Overview

Decision SnapshotNeeds Validation

Survey synthesizes many prototype systems and benchmarks but few large-scale clinical deployments; evidence is mixed and leans toward promising prototypes rather than production-ready solutions.

Citations8

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, Yixuan Yuan

Links

Abstract / PDF

Why It Matters For Business

LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.

Who Should Care

Summary TLDR

This is a systematic survey of 60 papers (2022–2024) about using large language model (LLM) agents in medicine. It breaks agent systems into profiles, planning, medical reasoning, and external tools. The survey catalogs clinical uses (decision support, documentation, training, operations), reviews benchmarks (static QA, workflow simulations, automated evaluation), and flags hard problems: hallucinations, multimodal fusion, bias, privacy, and system integration. Practical takeaway: LLM agents show promise but are not yet ready for unsupervised clinical use; rigorous, workflow-style evaluation and human oversight are needed.

Problem Statement

LLM-based agents promise to assist clinical tasks but medical settings demand higher safety, multimodal handling, auditability, and cross-department integration than current LLMs and benchmarks provide. The survey asks: how are agent architectures, planning, reasoning, evaluation, and deployment practices meeting those medical requirements?

Main Contribution

Structured taxonomy of LLM-based medical agent components: profile, clinical planning, medical reasoning, and external capacity.

Catalog of application scenarios and representative systems across diagnosis, documentation, training, and service optimization.

Key Findings

Surveyed literature size and scope.

Numbers60 studies reviewed (from ~300 initial hits, 80 shortlisted).

Practical UseThe field is active but early-stage; build pilots and tests, not full replacement deployments.

Evidence RefPaper §1 and Appendix A.1

Static QA benchmarks are abundant but limited for clinical workflows.

NumbersMedMCQA contains 194,000 questions across 2,400 topics.

Practical UseDon't rely only on QA accuracy for clinical readiness; test agents on sequence-based workflow benchmarks.

Evidence RefPaper §5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Survey coverage60 studies reviewedInitial search ~300 papersFinal inclusion 60literature selection 20222024Paper §1 and Appendix A.1
Size of MedMCQA dataset194,000 questions across 2,400 topicsMedMCQAPaper §5.1

What To Try In 7 Days

Run small workflow-style tests using AgentClinic or MedChain cases to mimic real tasks.

Audit candidate models on local patient samples for bias and precision.

Prototype a retrieval-augmented pipeline with EHR access and guideline checks.

Agent Features

Memory
Long-term MemoryRetrieval MemoryExperience Base
Planning
Task DecompositionAdaptive PlanningMulti-Agent CollaborationIterative Self-Evolution
Tool Use
Knowledge GraphsMedical CalculatorsEHR InterfacesImage AnalysisRetrieval-augmented Generation
Frameworks
ReActChain-of-ThoughtTree-of-ThoughtMDAgentsPolarisAgent Hospital
Is Agentic

Yes

Architectures
Single-AgentSequential Task ChainCollaborative ExpertsIterative Evolution
Collaboration
Multi-Agent CoordinationAgent CommunicationDepartmental Organization

Optimization Features

System Optimization
Federated learning for cross-site adaptability
Training Optimization
RLSimulation-driven iterative training (Agent Hospital)
Inference Optimization
Inference-time scaling to allow longer reasoning

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Covers literature up to early 2024; rapid developments after that may not be included.

Search focused on English-language publications in major databases.

When Not To Use

Do not deploy agent outputs as autonomous clinical decisions without human oversight.

Avoid using current agents for high-stakes care where errors risk patient harm.

Failure Modes

Hallucinations producing incorrect clinical facts or recommendations.

Dataset and algorithmic bias reducing accuracy for underrepresented groups.

Core Entities

Models

MedAideMDAgentsAgent HospitalPolarisRx StrategistColaCareClinicalLabAI HospitalMedChainAgentClinic

Metrics

AccuracyPrecisionRecallBLEUROUGEBertScoreLLM-based evaluation (ChatCoach)

Datasets

MedQAMedMCQAPubMedQAMMLUMIMIC-IIIMIMIC-IVMVME

Benchmarks

MedChainAI HospitalAgentClinicClinicalLabMedHallBenchHaluEvalBiasMedQAAI-SCERJUA-SPs

Context Entities

Models

DeepSeek R1 (reinforcement-learned reasoning)O1 / inference-time scaling examples

Metrics

Workflow-based simulation metricsSemantic similarity metricsLLM-based scoring

Datasets

BiasMedQAMedHallBenchHaluEval

Benchmarks

MedChainAgentClinicClinicalLab