Overview
Survey synthesizes many prototype systems and benchmarks but few large-scale clinical deployments; evidence is mixed and leans toward promising prototypes rather than production-ready solutions.
Citations8
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/3
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.
Who Should Care
Summary TLDR
This is a systematic survey of 60 papers (2022–2024) about using large language model (LLM) agents in medicine. It breaks agent systems into profiles, planning, medical reasoning, and external tools. The survey catalogs clinical uses (decision support, documentation, training, operations), reviews benchmarks (static QA, workflow simulations, automated evaluation), and flags hard problems: hallucinations, multimodal fusion, bias, privacy, and system integration. Practical takeaway: LLM agents show promise but are not yet ready for unsupervised clinical use; rigorous, workflow-style evaluation and human oversight are needed.
Problem Statement
LLM-based agents promise to assist clinical tasks but medical settings demand higher safety, multimodal handling, auditability, and cross-department integration than current LLMs and benchmarks provide. The survey asks: how are agent architectures, planning, reasoning, evaluation, and deployment practices meeting those medical requirements?
Main Contribution
Structured taxonomy of LLM-based medical agent components: profile, clinical planning, medical reasoning, and external capacity.
Catalog of application scenarios and representative systems across diagnosis, documentation, training, and service optimization.
Key Findings
Surveyed literature size and scope.
Static QA benchmarks are abundant but limited for clinical workflows.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Survey coverage | 60 studies reviewed | Initial search ~300 papers | Final inclusion 60 | literature selection 2022–2024 | Paper §1 and Appendix A.1 | — |
| Size of MedMCQA dataset | 194,000 questions across 2,400 topics | — | — | MedMCQA | Paper §5.1 | — |
What To Try In 7 Days
Run small workflow-style tests using AgentClinic or MedChain cases to mimic real tasks.
Audit candidate models on local patient samples for bias and precision.
Prototype a retrieval-augmented pipeline with EHR access and guideline checks.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Covers literature up to early 2024; rapid developments after that may not be included.
Search focused on English-language publications in major databases.
When Not To Use
Do not deploy agent outputs as autonomous clinical decisions without human oversight.
Avoid using current agents for high-stakes care where errors risk patient harm.
Failure Modes
Hallucinations producing incorrect clinical facts or recommendations.
Dataset and algorithmic bias reducing accuracy for underrepresented groups.

