MedAgentBench: a realistic FHIR-based EHR playground and 300-task benchmark for medical LLM agents

January 24, 20257 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen

Links

Abstract / PDF

Why It Matters For Business

MedAgentBench provides a practical testbed to measure LLM agents on real EHR tasks so teams can benchmark readiness before risky EHR integration.

Summary TLDR

MedAgentBench is a released benchmark and simulated EHR environment that measures how well LLMs can act as medical agents. It includes 300 clinician-written tasks across 10 categories, 100 realistic patient profiles with ~785k records, and a FHIR-compliant API + Docker image. The authors evaluated 12 LLMs (pass@1, max 8 interaction rounds). Best model (Claude 3.5 Sonnet v2) reached 69.67% overall success; query tasks beat action tasks. The suite is for development and research, not production deployment.

Problem Statement

There was no standard, interactive benchmark that tests LLMs as autonomous agents inside electronic health record (EHR) systems. That gap makes it hard to measure progress, compare models, and move agentic LLMs toward safe clinical use.

Main Contribution

MedAgentBench dataset: 300 clinician-written, verifiable tasks across 10 categories covering common EHR actions.

Interactive FHIR environment: 100 deidentified patient profiles (~785k records), HAPI FHIR server, Docker image, and standard API hooks to run agents.

Baseline evaluations: pass@1 evaluation of 12 state-of-the-art LLMs with public results and error analysis to show current capabilities and gaps.

Key Findings

Top-performing model achieved non-perfect but substantial task success.

NumbersClaude 3.5 Sonnet v2 overall SR = 69.67%

Information-retrieval tasks are easier than action-modifying tasks.

NumbersClaude query SR = 85.33% vs action SR = 54.00%

Task difficulty and step count strongly affect success.

NumbersClaude easy SR = 100% ; hard SR = 23.33%

Common failure modes are formatting and invalid API calls.

NumbersGemini 2.0 Flash invalid-action rate reported as 54% in examples

Results

models evaluated

Value12

best overall success rate

Value69.67%

best query vs action gap

Value85.33% vs 54.00%

records and patient profiles

Value785,207 total records; 100 patients

Who Should Care

What To Try In 7 Days

Run MedAgentBench on your candidate LLMs to identify query vs action weaknesses.

Start automations with read-only tasks (reports/lookup) where models perform better.

Add strict output-schema checks and logging before enabling any write actions.

Agent Features

Memory

  • short-term conversation rounds (max 8)

Planning

  • plans via selecting FHIR function calls (GET/POST)

Tool Use

  • FHIR REST API calls
  • GET/POST function invocation

Frameworks

  • AgentBench-inspired orchestrator

Is Agentic

true

Architectures

  • single LLM orchestrator (baseline)

Collaboration

  • not multi-agent; single-agent baseline

Optimization Features

Infra Optimization

  • Docker container for environment deployment

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Environment simulates EHR via FHIR but lacks production security, logging, and enterprise integrations.
  • Patient profiles come from a single institution (STARR) and may not generalize.
  • Evaluation used pass@1 and a simple orchestrator; other agent designs may behave differently.
  • Benchmark focuses on medical-record interactions; it omits surgical, multimodal, and team-coordination tasks.

When Not To Use

  • Do not use this as a drop-in test for production safety or compliance without extra security measures.
  • Not suitable for multimodal tasks (images, waveforms) or real-time clinical decision making without human review.
  • Avoid using benchmarked open-weight results as sole evidence for clinical deployment.

Failure Modes

  • Invalid API formatting or malformed GET/POST requests.
  • Wrong output format (free text vs expected structured value).
  • Incorrect modifications to records (logic or dosing errors).
  • High failure rate on multi-step/hard tasks (chaining errors).

Core Entities

Models

  • Claude 3.5 Sonnet v2
  • GPT-4o
  • GPT-4o mini
  • o3-mini
  • Gemini 2.0 Pro
  • Gemini 2.0 Flash
  • Gemini 1.5 Pro
  • DeepSeek-V3
  • Qwen2.5
  • Llama 3.3
  • Gemma2
  • Mistral v0.3

Metrics

  • task success rate (SR)
  • query SR
  • action SR
  • difficulty-stratified SR

Datasets

  • MedAgentBench (this paper)
  • STARR deidentified EHR extract (patient source)

Benchmarks

  • AgentBench
  • AgentClinic
  • AgentBoard