Survey of LLM-based medical agents: architectures, applications, and safety gaps

Overview

Decision SnapshotNeeds Validation

Survey synthesizes many prototype systems and benchmarks but few large-scale clinical deployments; evidence is mixed and leans toward promising prototypes rather than production-ready solutions.

Citations8

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, Yixuan Yuan

Links

Abstract / PDF

Why It Matters For Business

LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.

Who Should Care

CTO Product Manager ML Engineer Data Scientist CEO Founder

Summary TLDR

This is a systematic survey of 60 papers (2022–2024) about using large language model (LLM) agents in medicine. It breaks agent systems into profiles, planning, medical reasoning, and external tools. The survey catalogs clinical uses (decision support, documentation, training, operations), reviews benchmarks (static QA, workflow simulations, automated evaluation), and flags hard problems: hallucinations, multimodal fusion, bias, privacy, and system integration. Practical takeaway: LLM agents show promise but are not yet ready for unsupervised clinical use; rigorous, workflow-style evaluation and human oversight are needed.

Problem Statement

LLM-based agents promise to assist clinical tasks but medical settings demand higher safety, multimodal handling, auditability, and cross-department integration than current LLMs and benchmarks provide. The survey asks: how are agent architectures, planning, reasoning, evaluation, and deployment practices meeting those medical requirements?

Main Contribution

Structured taxonomy of LLM-based medical agent components: profile, clinical planning, medical reasoning, and external capacity.

Catalog of application scenarios and representative systems across diagnosis, documentation, training, and service optimization.

Key Findings

Surveyed literature size and scope.

Numbers60 studies reviewed (from ~300 initial hits, 80 shortlisted).

Practical UseThe field is active but early-stage; build pilots and tests, not full replacement deployments.

Evidence RefPaper §1 and Appendix A.1

Static QA benchmarks are abundant but limited for clinical workflows.

NumbersMedMCQA contains 194,000 questions across 2,400 topics.

Practical UseDon't rely only on QA accuracy for clinical readiness; test agents on sequence-based workflow benchmarks.

Evidence RefPaper §5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Survey coverage	60 studies reviewed	Initial search ~300 papers	Final inclusion 60	literature selection 2022–2024	Paper §1 and Appendix A.1	—
Size of MedMCQA dataset	194,000 questions across 2,400 topics	—	—	MedMCQA	Paper §5.1	—

What To Try In 7 Days

Run small workflow-style tests using AgentClinic or MedChain cases to mimic real tasks.

Audit candidate models on local patient samples for bias and precision.

Prototype a retrieval-augmented pipeline with EHR access and guideline checks.

Agent Features

Memory

Long-term MemoryRetrieval MemoryExperience Base

Planning

Task DecompositionAdaptive PlanningMulti-Agent CollaborationIterative Self-Evolution

Tool Use

Knowledge GraphsMedical CalculatorsEHR InterfacesImage AnalysisRetrieval-augmented Generation

Frameworks

ReActChain-of-ThoughtTree-of-ThoughtMDAgentsPolarisAgent Hospital

Is Agentic

Yes

Architectures

Single-AgentSequential Task ChainCollaborative ExpertsIterative Evolution

Collaboration

Multi-Agent CoordinationAgent CommunicationDepartmental Organization

Optimization Features

System Optimization

Federated learning for cross-site adaptability

Training Optimization

RLSimulation-driven iterative training (Agent Hospital)

Inference Optimization

Inference-time scaling to allow longer reasoning

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Covers literature up to early 2024; rapid developments after that may not be included.

Search focused on English-language publications in major databases.

When Not To Use

Do not deploy agent outputs as autonomous clinical decisions without human oversight.

Avoid using current agents for high-stakes care where errors risk patient harm.

Failure Modes

Hallucinations producing incorrect clinical facts or recommendations.

Dataset and algorithmic bias reducing accuracy for underrepresented groups.

Core Entities

Models

MedAideMDAgentsAgent HospitalPolarisRx StrategistColaCareClinicalLabAI HospitalMedChainAgentClinic

Metrics

AccuracyPrecisionRecallBLEUROUGEBertScoreLLM-based evaluation (ChatCoach)

Datasets

MedQAMedMCQAPubMedQAMMLUMIMIC-IIIMIMIC-IVMVME

Benchmarks

MedChainAI HospitalAgentClinicClinicalLabMedHallBenchHaluEvalBiasMedQAAI-SCERJUA-SPs

Context Entities

Models

DeepSeek R1 (reinforcement-learned reasoning)O1 / inference-time scaling examples

Metrics

Workflow-based simulation metricsSemantic similarity metricsLLM-based scoring

Datasets

BiasMedQAMedHallBenchHaluEval

Benchmarks

MedChainAgentClinicClinicalLab

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Surveyed literature size and scope.

Static QA benchmarks are abundant but limited for clinical workflows.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding