Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
8
Why It Matters For Business
LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.
Summary TLDR
This is a systematic survey of 60 papers (2022–2024) about using large language model (LLM) agents in medicine. It breaks agent systems into profiles, planning, medical reasoning, and external tools. The survey catalogs clinical uses (decision support, documentation, training, operations), reviews benchmarks (static QA, workflow simulations, automated evaluation), and flags hard problems: hallucinations, multimodal fusion, bias, privacy, and system integration. Practical takeaway: LLM agents show promise but are not yet ready for unsupervised clinical use; rigorous, workflow-style evaluation and human oversight are needed.
Problem Statement
LLM-based agents promise to assist clinical tasks but medical settings demand higher safety, multimodal handling, auditability, and cross-department integration than current LLMs and benchmarks provide. The survey asks: how are agent architectures, planning, reasoning, evaluation, and deployment practices meeting those medical requirements?
Main Contribution
Structured taxonomy of LLM-based medical agent components: profile, clinical planning, medical reasoning, and external capacity.
Catalog of application scenarios and representative systems across diagnosis, documentation, training, and service optimization.
Review of evaluation approaches and benchmarks, highlighting gaps between static QA tests and workflow-style clinical evaluation.
Identification of key technical and ethical challenges and concrete future directions (hallucination control, multimodal fusion, real-time correction, integration with physical systems).
Key Findings
Surveyed literature size and scope.
Static QA benchmarks are abundant but limited for clinical workflows.
Workflow benchmarks provide richer clinical tests.
Bias and poor precision exist in current medical LLMs.
Results
Survey coverage
Size of MedMCQA dataset
BiasMedQA precision range reported
Who Should Care
What To Try In 7 Days
Run small workflow-style tests using AgentClinic or MedChain cases to mimic real tasks.
Audit candidate models on local patient samples for bias and precision.
Prototype a retrieval-augmented pipeline with EHR access and guideline checks.
Agent Features
Memory
- Long-term Memory
- Retrieval Memory
- Experience Base
Planning
- Task Decomposition
- Adaptive Planning
- Multi-Agent Collaboration
- Iterative Self-Evolution
Tool Use
- Knowledge Graphs
- Medical Calculators
- EHR Interfaces
- Image Analysis
- Retrieval-augmented Generation
Frameworks
- ReAct
- Chain-of-Thought
- Tree-of-Thought
- MDAgents
- Polaris
- Agent Hospital
Is Agentic
true
Architectures
- Single-Agent
- Sequential Task Chain
- Collaborative Experts
- Iterative Evolution
Collaboration
- Multi-Agent Coordination
- Agent Communication
- Departmental Organization
Optimization Features
System Optimization
- Federated learning for cross-site adaptability
Training Optimization
- RL
- Simulation-driven iterative training (Agent Hospital)
Inference Optimization
- Inference-time scaling to allow longer reasoning
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Covers literature up to early 2024; rapid developments after that may not be included.
- Search focused on English-language publications in major databases.
- Survey paper: no new empirical experiments or unified benchmark comparisons presented.
When Not To Use
- Do not deploy agent outputs as autonomous clinical decisions without human oversight.
- Avoid using current agents for high-stakes care where errors risk patient harm.
- Do not rely solely on static QA scores to assess clinical readiness.
Failure Modes
- Hallucinations producing incorrect clinical facts or recommendations.
- Dataset and algorithmic bias reducing accuracy for underrepresented groups.
- Integration failures across departments or EHR systems causing incorrect data flow.
- Privacy leaks from training or inference when handling sensitive records.
Core Entities
Models
- MedAide
- MDAgents
- Agent Hospital
- Polaris
- Rx Strategist
- ColaCare
- ClinicalLab
- AI Hospital
- MedChain
- AgentClinic
Metrics
- Accuracy
- Precision
- Recall
- BLEU
- ROUGE
- BertScore
- LLM-based evaluation (ChatCoach)
Datasets
- MedQA
- MedMCQA
- PubMedQA
- MMLU
- MIMIC-III
- MIMIC-IV
- MVME
Benchmarks
- MedChain
- AI Hospital
- AgentClinic
- ClinicalLab
- MedHallBench
- HaluEval
- BiasMedQA
- AI-SCE
- RJUA-SPs
Context Entities
Models
- DeepSeek R1 (reinforcement-learned reasoning)
- O1 / inference-time scaling examples
Metrics
- Workflow-based simulation metrics
- Semantic similarity metrics
- LLM-based scoring
Datasets
- BiasMedQA
- MedHallBench
- HaluEval
Benchmarks
- MedChain
- AgentClinic
- ClinicalLab

