MedAgentBench: a realistic FHIR-based EHR playground and 300-task benchmark for medical LLM agents

Overview

Decision SnapshotNeeds Validation

The benchmark is immediately useful for development and comparison, but models evaluated are not production-ready; use it for staging and pre-deployment tests.

Citations3

Evidence Strength0.85

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 60%

Authors

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MedAgentBench provides a practical testbed to measure LLM agents on real EHR tasks so teams can benchmark readiness before risky EHR integration.

Who Should Care

CTO Product Manager ML Engineer Founder

Summary TLDR

MedAgentBench is a released benchmark and simulated EHR environment that measures how well LLMs can act as medical agents. It includes 300 clinician-written tasks across 10 categories, 100 realistic patient profiles with ~785k records, and a FHIR-compliant API + Docker image. The authors evaluated 12 LLMs (pass@1, max 8 interaction rounds). Best model (Claude 3.5 Sonnet v2) reached 69.67% overall success; query tasks beat action tasks. The suite is for development and research, not production deployment.

Problem Statement

There was no standard, interactive benchmark that tests LLMs as autonomous agents inside electronic health record (EHR) systems. That gap makes it hard to measure progress, compare models, and move agentic LLMs toward safe clinical use.

Main Contribution

MedAgentBench dataset: 300 clinician-written, verifiable tasks across 10 categories covering common EHR actions.

Interactive FHIR environment: 100 deidentified patient profiles (~785k records), HAPI FHIR server, Docker image, and standard API hooks to run agents.

Key Findings

Top-performing model achieved non-perfect but substantial task success.

NumbersClaude 3.5 Sonnet v2 overall SR = 69.67%

Practical UseModels can complete many EHR tasks, but 30%+ failures remain; do not deploy without human oversight.

Evidence RefTable 3

Information-retrieval tasks are easier than action-modifying tasks.

NumbersClaude query SR = 85.33% vs action SR = 54.00%

Practical UseStart by automating read-only workflows (reports, lookups) before write/modify actions in EHRs.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
models evaluated	12	—	—	MedAgentBench	Twelve state-of-the-art LLMs were benchmarked	Section 2.4.2 and Table 3
best overall success rate	69.67%	—	—	MedAgentBench overall	Claude 3.5 Sonnet v2 overall SR = 69.67%	Table 3

What To Try In 7 Days

Run MedAgentBench on your candidate LLMs to identify query vs action weaknesses.

Start automations with read-only tasks (reports/lookup) where models perform better.

Add strict output-schema checks and logging before enabling any write actions.

Agent Features

Memory

short-term conversation rounds (max 8)

Planning

plans via selecting FHIR function calls (GET/POST)

Tool Use

FHIR REST API callsGET/POST function invocation

Frameworks

AgentBench-inspired orchestrator

Is Agentic

Yes

Architectures

single LLM orchestrator (baseline)

Collaboration

not multi-agent; single-agent baseline

Optimization Features

Infra Optimization

Docker container for environment deployment

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/stanfordmlgroup/MedAgentBench https://hub.docker.com/r/jyxsu6/medagentbench

Data URLs

https://github.com/stanfordmlgroup/MedAgentBench https://hub.docker.com/r/jyxsu6/medagentbench

Risks & Boundaries

Limitations

Environment simulates EHR via FHIR but lacks production security, logging, and enterprise integrations.

Patient profiles come from a single institution (STARR) and may not generalize.

When Not To Use

Do not use this as a drop-in test for production safety or compliance without extra security measures.

Not suitable for multimodal tasks (images, waveforms) or real-time clinical decision making without human review.

Failure Modes

Invalid API formatting or malformed GET/POST requests.

Wrong output format (free text vs expected structured value).

Core Entities

Models

Claude 3.5 Sonnet v2GPT-4oGPT-4o minio3-miniGemini 2.0 ProGemini 2.0 FlashGemini 1.5 ProDeepSeek-V3Qwen2.5Llama 3.3Gemma2Mistral v0.3

Metrics

task success rate (SR)query SRaction SRdifficulty-stratified SR

Datasets

MedAgentBench (this paper)STARR deidentified EHR extract (patient source)

Benchmarks

AgentBenchAgentClinicAgentBoard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top-performing model achieved non-perfect but substantial task success.

Information-retrieval tasks are easier than action-modifying tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding