MedAgentBench: a realistic FHIR-based EHR playground and 300-task benchmark for medical LLM agents

January 24, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark is immediately useful for development and comparison, but models evaluated are not production-ready; use it for staging and pre-deployment tests.

Citations3

Evidence Strength0.85

Confidence0.88

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 30%

Novelty: 60%

Authors

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MedAgentBench provides a practical testbed to measure LLM agents on real EHR tasks so teams can benchmark readiness before risky EHR integration.

Who Should Care

Summary TLDR

MedAgentBench is a released benchmark and simulated EHR environment that measures how well LLMs can act as medical agents. It includes 300 clinician-written tasks across 10 categories, 100 realistic patient profiles with ~785k records, and a FHIR-compliant API + Docker image. The authors evaluated 12 LLMs (pass@1, max 8 interaction rounds). Best model (Claude 3.5 Sonnet v2) reached 69.67% overall success; query tasks beat action tasks. The suite is for development and research, not production deployment.

Problem Statement

There was no standard, interactive benchmark that tests LLMs as autonomous agents inside electronic health record (EHR) systems. That gap makes it hard to measure progress, compare models, and move agentic LLMs toward safe clinical use.

Main Contribution

MedAgentBench dataset: 300 clinician-written, verifiable tasks across 10 categories covering common EHR actions.

Interactive FHIR environment: 100 deidentified patient profiles (~785k records), HAPI FHIR server, Docker image, and standard API hooks to run agents.

Key Findings

Top-performing model achieved non-perfect but substantial task success.

NumbersClaude 3.5 Sonnet v2 overall SR = 69.67%

Practical UseModels can complete many EHR tasks, but 30%+ failures remain; do not deploy without human oversight.

Evidence RefTable 3

Information-retrieval tasks are easier than action-modifying tasks.

NumbersClaude query SR = 85.33% vs action SR = 54.00%

Practical UseStart by automating read-only workflows (reports, lookups) before write/modify actions in EHRs.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
models evaluated12MedAgentBenchTwelve state-of-the-art LLMs were benchmarkedSection 2.4.2 and Table 3
best overall success rate69.67%MedAgentBench overallClaude 3.5 Sonnet v2 overall SR = 69.67%Table 3

What To Try In 7 Days

Run MedAgentBench on your candidate LLMs to identify query vs action weaknesses.

Start automations with read-only tasks (reports/lookup) where models perform better.

Add strict output-schema checks and logging before enabling any write actions.

Agent Features

Memory
short-term conversation rounds (max 8)
Planning
plans via selecting FHIR function calls (GET/POST)
Tool Use
FHIR REST API callsGET/POST function invocation
Frameworks
AgentBench-inspired orchestrator
Is Agentic

Yes

Architectures
single LLM orchestrator (baseline)
Collaboration
not multi-agent; single-agent baseline

Optimization Features

Infra Optimization
Docker container for environment deployment

Reproducibility

Risks & Boundaries

Limitations

Environment simulates EHR via FHIR but lacks production security, logging, and enterprise integrations.

Patient profiles come from a single institution (STARR) and may not generalize.

When Not To Use

Do not use this as a drop-in test for production safety or compliance without extra security measures.

Not suitable for multimodal tasks (images, waveforms) or real-time clinical decision making without human review.

Failure Modes

Invalid API formatting or malformed GET/POST requests.

Wrong output format (free text vs expected structured value).

Core Entities

Models

Claude 3.5 Sonnet v2GPT-4oGPT-4o minio3-miniGemini 2.0 ProGemini 2.0 FlashGemini 1.5 ProDeepSeek-V3Qwen2.5Llama 3.3Gemma2Mistral v0.3

Metrics

task success rate (SR)query SRaction SRdifficulty-stratified SR

Datasets

MedAgentBench (this paper)STARR deidentified EHR extract (patient source)

Benchmarks

AgentBenchAgentClinicAgentBoard