AgentClinic: interactive, multimodal simulations that stress-test LLMs on real-style clinical decision making

May 13, 20247 min

Overview

Decision SnapshotNeeds Validation

AgentClinic offers a useful, realistic stress test for clinical agents; results are informative but use simulated patients and some proprietary models, so treat outcomes as evaluation signals, not deployment proof.

Citations18

Evidence Strength0.70

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 70%

Authors

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.

Who Should Care

Summary TLDR

AgentClinic is an open-source benchmark that turns static medical QA into interactive clinical simulations. It runs doctor, patient, measurement, and moderator agents to test multimodal dialogue, image understanding, tool use (retrieval, notebook memory, chain-of-thought), specialist cases and seven languages. Strong models (Claude-3.5) beat others on average, but performance drops versus static QA, tools help unevenly, and biases plus limited interaction time materially harm outcomes and patient trust. The platform is designed for evaluation and red‑teaming, not deployment.

Problem Statement

Standard medical benchmarks give all facts up front and test single-turn QA. Real clinical work is sequential: you ask questions, order tests, read images, and manage bias and trust. We need an evaluation that simulates those steps, supports multimodal inputs, tool use, multilingual and specialist cases, and measures patient-centered outcomes.

Main Contribution

AgentClinic: an open-source interactive benchmark that runs doctor, patient, measurement and moderator language agents to simulate clinical workflows.

Multimodal suite: 120 multimodal NEJM cases, 215 MedQA-derived OSCE cases, 200 MIMIC-IV-based cases, 260 specialist cases, and 749 multilingual cases.

Key Findings

Interactive, sequential format is harder than static QA.

NumbersDiagnostic accuracy can fall below 10% of static baseline (paper statement).

Practical UseDon't trust static MedQA scores to predict dialogue-style clinical performance; test models in interactive settings before clinical use.

Evidence RefAbstract; Figure 3; Discussion

Claude-3.5 outperformed most other models on AgentClinic-MedQA.

NumbersClaude-3.5: 62.1% vs GPT-4: 51.6% vs Human physicians: 54% (MedQA config using GPT-4 patient agent).

Practical UseFor agent-style clinical experiments, prefer Claude-3.5 for higher baseline accuracy; but validate on your dataset and tasks.

Evidence RefFigure 2; D.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy62.1% ± 3.3AgentClinic-MedQAFigure 2; D.2D.2
Accuracy51.6% ± 3.3AgentClinic-MedQAFigure 2; D.2D.2

What To Try In 7 Days

Run your model on a few AgentClinic cases to compare static vs interactive accuracy.

Measure model gains from notebook memory and adaptive retrieval per model.

Inject common cognitive/implicit biases and record changes in accuracy and simulated patient trust.

Agent Features

Memory
notebook (persistent experiential memory)
Planning
sequential decision making (ask/order/diagnose)
Tool Use
adaptive retrievalnotebook memoryChain-of-Thoughtweb/textbook search
Frameworks
OSCE templateagent moderator for evaluation
Is Agentic

Yes

Architectures
multi-agent simulationOSCE-style agent pipeline
Collaboration
multi-agent debateMedAgents task delegation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Data URLs

MIMIC-IV (restricted access)

Risks & Boundaries

Limitations

Simulated patients and measurement readers are LLM-based and may not match real human behavior.

Moderator uses an LLM judge; potential bias if judge favors model-generated text.

When Not To Use

As evidence of clinical safety for live patient care without human oversight.

For legal or regulatory certification of medical devices.

Failure Modes

Model fixation on initial hypothesis (anchoring) leading to missed diagnoses.

Hallucinated or incomplete measurement results when instrument agent details are missing.

Core Entities

Models

Claude-3.5-SonnetGPT-4GPT-4oGPT-3.5Mixtral-8x7BLlama-3-70BLlama-2-70B-chatMedLlama3-8BOpenBioLLM-70BPMC-Llama-7BMeditron-70Bo1-preview

Metrics

Accuracypatient confidencepatient complianceconsultation rating

Datasets

AgentClinic-MedQAAgentClinic-MIMIC-IVAgentClinic-NEJMAgentClinic-SpecAgentClinic-LangMedQAMIMIC-IVNEJMMedMCQAUSMLE

Benchmarks

AgentClinic