AgentClinic: interactive, multimodal simulations that stress-test LLMs on real-style clinical decision making

May 13, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

18

Authors

Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, Michael Moor

Links

Abstract / PDF

Why It Matters For Business

Static medical QA overstates real-world performance. Interactive, multimodal tests reveal gaps in data gathering, tool use, and bias handling that directly affect safety and product trust.

Summary TLDR

AgentClinic is an open-source benchmark that turns static medical QA into interactive clinical simulations. It runs doctor, patient, measurement, and moderator agents to test multimodal dialogue, image understanding, tool use (retrieval, notebook memory, chain-of-thought), specialist cases and seven languages. Strong models (Claude-3.5) beat others on average, but performance drops versus static QA, tools help unevenly, and biases plus limited interaction time materially harm outcomes and patient trust. The platform is designed for evaluation and red‑teaming, not deployment.

Problem Statement

Standard medical benchmarks give all facts up front and test single-turn QA. Real clinical work is sequential: you ask questions, order tests, read images, and manage bias and trust. We need an evaluation that simulates those steps, supports multimodal inputs, tool use, multilingual and specialist cases, and measures patient-centered outcomes.

Main Contribution

AgentClinic: an open-source interactive benchmark that runs doctor, patient, measurement and moderator language agents to simulate clinical workflows.

Multimodal suite: 120 multimodal NEJM cases, 215 MedQA-derived OSCE cases, 200 MIMIC-IV-based cases, 260 specialist cases, and 749 multilingual cases.

Tool and bias experiments: integrated tools (adaptive retrieval, notebook memory, CoT variants) and 23 bias scenarios to measure diagnostic accuracy and patient-perception metrics.

Key Findings

Interactive, sequential format is harder than static QA.

NumbersDiagnostic accuracy can fall below 10% of static baseline (paper statement).

Claude-3.5 outperformed most other models on AgentClinic-MedQA.

NumbersClaude-3.5: 62.1% vs GPT-4: 51.6% vs Human physicians: 54% (MedQA config using GPT-4 patient agent).

Tool effects vary strongly across models; persistent notebook helps some models a lot.

NumbersNotebook effect examples: Claude +2.9 pp, GPT-4 +3.2 pp, Llama-3 reported +19.7 pp (paper also reports up to 92% rel. in

Bias reduces accuracy and strongly affects patient perceptions.

NumbersGPT-4: unbiased 52% → patient cognitive bias 48% (−4 pp). Mixtral: 37% → 31% (doctor bias normalized 83.7%). Patient-rep

Less interaction time markedly harms diagnosis.

NumbersN=20 interactions: 52% → N=10: 25% (absolute drop 27 pp); N=30: slight drop to 43%.

Results

Accuracy

Value62.1% ± 3.3

Accuracy

Value51.6% ± 3.3

Accuracy

Value54% ± 28.5

Accuracy

Value42.9% ± 3.3

Accuracy

Value37.2% ± 2.2

Accuracy

Value80.6% ± 5.6

Who Should Care

What To Try In 7 Days

Run your model on a few AgentClinic cases to compare static vs interactive accuracy.

Measure model gains from notebook memory and adaptive retrieval per model.

Inject common cognitive/implicit biases and record changes in accuracy and simulated patient trust.

Agent Features

Memory

  • notebook (persistent experiential memory)

Planning

  • sequential decision making (ask/order/diagnose)

Tool Use

  • adaptive retrieval
  • notebook memory
  • Chain-of-Thought
  • web/textbook search

Frameworks

  • OSCE template
  • agent moderator for evaluation

Is Agentic

true

Architectures

  • multi-agent simulation
  • OSCE-style agent pipeline

Collaboration

  • multi-agent debate
  • MedAgents task delegation

Reproducibility

Data Urls

  • MIMIC-IV (restricted access)

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Simulated patients and measurement readers are LLM-based and may not match real human behavior.
  • Moderator uses an LLM judge; potential bias if judge favors model-generated text.
  • Proprietary model training data unknown; possible data leakage from benchmark sources.
  • Environment omits roles like nurses, relatives, and operational constraints (beds, scheduling).

When Not To Use

  • As evidence of clinical safety for live patient care without human oversight.
  • For legal or regulatory certification of medical devices.
  • To assume real patient trust or compliance — simulated agents are imperfect proxies.

Failure Modes

  • Model fixation on initial hypothesis (anchoring) leading to missed diagnoses.
  • Hallucinated or incomplete measurement results when instrument agent details are missing.
  • Tool misuse degrading performance (some tools reduced accuracy for certain models).
  • Cross-model communication mismatches when different LLMs act as patient and doctor.

Core Entities

Models

  • Claude-3.5-Sonnet
  • GPT-4
  • GPT-4o
  • GPT-3.5
  • Mixtral-8x7B
  • Llama-3-70B
  • Llama-2-70B-chat
  • MedLlama3-8B
  • OpenBioLLM-70B
  • PMC-Llama-7B
  • Meditron-70B
  • o1-preview

Metrics

  • Accuracy
  • patient confidence
  • patient compliance
  • consultation rating

Datasets

  • AgentClinic-MedQA
  • AgentClinic-MIMIC-IV
  • AgentClinic-NEJM
  • AgentClinic-Spec
  • AgentClinic-Lang
  • MedQA
  • MIMIC-IV
  • NEJM
  • MedMCQA
  • USMLE

Benchmarks

  • AgentClinic