AgentClinic: interactive, multimodal simulations that stress-test LLMs on real-style clinical decision making

Overview

Decision SnapshotNeeds Validation

AgentClinic offers a useful, realistic stress test for clinical agents; results are informative but use simulated patients and some proprietary models, so treat outcomes as evaluation signals, not deployment proof.

Citations18

Evidence Strength0.70

Confidence0.87

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 70%

Authors

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

AgentClinic is an open-source benchmark that turns static medical QA into interactive clinical simulations. It runs doctor, patient, measurement, and moderator agents to test multimodal dialogue, image understanding, tool use (retrieval, notebook memory, chain-of-thought), specialist cases and seven languages. Strong models (Claude-3.5) beat others on average, but performance drops versus static QA, tools help unevenly, and biases plus limited interaction time materially harm outcomes and patient trust. The platform is designed for evaluation and red‑teaming, not deployment.

Problem Statement

Standard medical benchmarks give all facts up front and test single-turn QA. Real clinical work is sequential: you ask questions, order tests, read images, and manage bias and trust. We need an evaluation that simulates those steps, supports multimodal inputs, tool use, multilingual and specialist cases, and measures patient-centered outcomes.

Main Contribution

AgentClinic: an open-source interactive benchmark that runs doctor, patient, measurement and moderator language agents to simulate clinical workflows.

Multimodal suite: 120 multimodal NEJM cases, 215 MedQA-derived OSCE cases, 200 MIMIC-IV-based cases, 260 specialist cases, and 749 multilingual cases.

Key Findings

Interactive, sequential format is harder than static QA.

NumbersDiagnostic accuracy can fall below 10% of static baseline (paper statement).

Practical UseDon't trust static MedQA scores to predict dialogue-style clinical performance; test models in interactive settings before clinical use.

Evidence RefAbstract; Figure 3; Discussion

Claude-3.5 outperformed most other models on AgentClinic-MedQA.

NumbersClaude-3.5: 62.1% vs GPT-4: 51.6% vs Human physicians: 54% (MedQA config using GPT-4 patient agent).

Practical UseFor agent-style clinical experiments, prefer Claude-3.5 for higher baseline accuracy; but validate on your dataset and tasks.

Evidence RefFigure 2; D.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	62.1% ± 3.3	—	—	AgentClinic-MedQA	Figure 2; D.2	D.2
Accuracy	51.6% ± 3.3	—	—	AgentClinic-MedQA	Figure 2; D.2	D.2

What To Try In 7 Days

Run your model on a few AgentClinic cases to compare static vs interactive accuracy.

Measure model gains from notebook memory and adaptive retrieval per model.

Inject common cognitive/implicit biases and record changes in accuracy and simulated patient trust.

Agent Features

Memory

notebook (persistent experiential memory)

Planning

sequential decision making (ask/order/diagnose)

Tool Use

adaptive retrievalnotebook memoryChain-of-Thoughtweb/textbook search

Frameworks

OSCE templateagent moderator for evaluation

Is Agentic

Yes

Architectures

multi-agent simulationOSCE-style agent pipeline

Collaboration

multi-agent debateMedAgents task delegation

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://agentclinic.github.io/

Data URLs

MIMIC-IV (restricted access)

Risks & Boundaries

Limitations

Simulated patients and measurement readers are LLM-based and may not match real human behavior.

Moderator uses an LLM judge; potential bias if judge favors model-generated text.

When Not To Use

As evidence of clinical safety for live patient care without human oversight.

For legal or regulatory certification of medical devices.

Failure Modes

Model fixation on initial hypothesis (anchoring) leading to missed diagnoses.

Hallucinated or incomplete measurement results when instrument agent details are missing.

Core Entities

Models

Claude-3.5-SonnetGPT-4GPT-4oGPT-3.5Mixtral-8x7BLlama-3-70BLlama-2-70B-chatMedLlama3-8BOpenBioLLM-70BPMC-Llama-7BMeditron-70Bo1-preview

Metrics

Accuracypatient confidencepatient complianceconsultation rating

Datasets

AgentClinic-MedQAAgentClinic-MIMIC-IVAgentClinic-NEJMAgentClinic-SpecAgentClinic-LangMedQAMIMIC-IVNEJMMedMCQAUSMLE

Benchmarks

AgentClinic

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Interactive, sequential format is harder than static QA.

Claude-3.5 outperformed most other models on AgentClinic-MedQA.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-