Overview
AgentClinic offers a useful, realistic stress test for clinical agents; results are informative but use simulated patients and some proprietary models, so treat outcomes as evaluation signals, not deployment proof.
Citations18
Evidence Strength0.70
Confidence0.87
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 70%
Why It Matters For Business
Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.
Who Should Care
Summary TLDR
AgentClinic is an open-source benchmark that turns static medical QA into interactive clinical simulations. It runs doctor, patient, measurement, and moderator agents to test multimodal dialogue, image understanding, tool use (retrieval, notebook memory, chain-of-thought), specialist cases and seven languages. Strong models (Claude-3.5) beat others on average, but performance drops versus static QA, tools help unevenly, and biases plus limited interaction time materially harm outcomes and patient trust. The platform is designed for evaluation and red‑teaming, not deployment.
Problem Statement
Standard medical benchmarks give all facts up front and test single-turn QA. Real clinical work is sequential: you ask questions, order tests, read images, and manage bias and trust. We need an evaluation that simulates those steps, supports multimodal inputs, tool use, multilingual and specialist cases, and measures patient-centered outcomes.
Main Contribution
AgentClinic: an open-source interactive benchmark that runs doctor, patient, measurement and moderator language agents to simulate clinical workflows.
Multimodal suite: 120 multimodal NEJM cases, 215 MedQA-derived OSCE cases, 200 MIMIC-IV-based cases, 260 specialist cases, and 749 multilingual cases.
Key Findings
Interactive, sequential format is harder than static QA.
Claude-3.5 outperformed most other models on AgentClinic-MedQA.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 62.1% ± 3.3 | — | — | AgentClinic-MedQA | Figure 2; D.2 | D.2 |
| Accuracy | 51.6% ± 3.3 | — | — | AgentClinic-MedQA | Figure 2; D.2 | D.2 |
What To Try In 7 Days
Run your model on a few AgentClinic cases to compare static vs interactive accuracy.
Measure model gains from notebook memory and adaptive retrieval per model.
Inject common cognitive/implicit biases and record changes in accuracy and simulated patient trust.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Simulated patients and measurement readers are LLM-based and may not match real human behavior.
Moderator uses an LLM judge; potential bias if judge favors model-generated text.
When Not To Use
As evidence of clinical safety for live patient care without human oversight.
For legal or regulatory certification of medical devices.
Failure Modes
Model fixation on initial hypothesis (anchoring) leading to missed diagnoses.
Hallucinated or incomplete measurement results when instrument agent details are missing.

