Overview
Production Readiness
0.3
Novelty Score
0.7
Cost Impact Score
0.4
Citation Count
18
Why It Matters For Business
Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.
Summary TLDR
AgentClinic is an open-source benchmark that turns static medical QA into interactive clinical simulations. It runs doctor, patient, measurement, and moderator agents to test multimodal dialogue, image understanding, tool use (retrieval, notebook memory, chain-of-thought), specialist cases and seven languages. Strong models (Claude-3.5) beat others on average, but performance drops versus static QA, tools help unevenly, and biases plus limited interaction time materially harm outcomes and patient trust. The platform is designed for evaluation and red‑teaming, not deployment.
Problem Statement
Standard medical benchmarks give all facts up front and test single-turn QA. Real clinical work is sequential: you ask questions, order tests, read images, and manage bias and trust. We need an evaluation that simulates those steps, supports multimodal inputs, tool use, multilingual and specialist cases, and measures patient-centered outcomes.
Main Contribution
AgentClinic: an open-source interactive benchmark that runs doctor, patient, measurement and moderator language agents to simulate clinical workflows.
Multimodal suite: 120 multimodal NEJM cases, 215 MedQA-derived OSCE cases, 200 MIMIC-IV-based cases, 260 specialist cases, and 749 multilingual cases.
Tool and bias experiments: integrated tools (adaptive retrieval, notebook memory, CoT variants) and 23 bias scenarios to measure diagnostic accuracy and patient-perception metrics.
Key Findings
Interactive, sequential format is harder than static QA.
Claude-3.5 outperformed most other models on AgentClinic-MedQA.
Tool effects vary strongly across models; persistent notebook helps some models a lot.
Bias reduces accuracy and strongly affects patient perceptions.
Less interaction time markedly harms diagnosis.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run your model on a few AgentClinic cases to compare static vs interactive accuracy.
Measure model gains from notebook memory and adaptive retrieval per model.
Inject common cognitive/implicit biases and record changes in accuracy and simulated patient trust.
Agent Features
Memory
- notebook (persistent experiential memory)
Planning
- sequential decision making (ask/order/diagnose)
Tool Use
- adaptive retrieval
- notebook memory
- Chain-of-Thought
- web/textbook search
Frameworks
- OSCE template
- agent moderator for evaluation
Is Agentic
true
Architectures
- multi-agent simulation
- OSCE-style agent pipeline
Collaboration
- multi-agent debate
- MedAgents task delegation
Reproducibility
Code Urls
Data Urls
- MIMIC-IV (restricted access)
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Simulated patients and measurement readers are LLM-based and may not match real human behavior.
- Moderator uses an LLM judge; potential bias if judge favors model-generated text.
- Proprietary model training data unknown; possible data leakage from benchmark sources.
- Environment omits roles like nurses, relatives, and operational constraints (beds, scheduling).
When Not To Use
- As evidence of clinical safety for live patient care without human oversight.
- For legal or regulatory certification of medical devices.
- To assume real patient trust or compliance — simulated agents are imperfect proxies.
Failure Modes
- Model fixation on initial hypothesis (anchoring) leading to missed diagnoses.
- Hallucinated or incomplete measurement results when instrument agent details are missing.
- Tool misuse degrading performance (some tools reduced accuracy for certain models).
- Cross-model communication mismatches when different LLMs act as patient and doctor.
Core Entities
Models
- Claude-3.5-Sonnet
- GPT-4
- GPT-4o
- GPT-3.5
- Mixtral-8x7B
- Llama-3-70B
- Llama-2-70B-chat
- MedLlama3-8B
- OpenBioLLM-70B
- PMC-Llama-7B
- Meditron-70B
- o1-preview
Metrics
- Accuracy
- patient confidence
- patient compliance
- consultation rating
Datasets
- AgentClinic-MedQA
- AgentClinic-MIMIC-IV
- AgentClinic-NEJM
- AgentClinic-Spec
- AgentClinic-Lang
- MedQA
- MIMIC-IV
- NEJM
- MedMCQA
- USMLE
Benchmarks
- AgentClinic

