Overview
The benchmark is immediately useful for development and comparison, but models evaluated are not production-ready; use it for staging and pre-deployment tests.
Citations3
Evidence Strength0.85
Confidence0.88
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
MedAgentBench provides a practical testbed to measure LLM agents on real EHR tasks so teams can benchmark readiness before risky EHR integration.
Who Should Care
Summary TLDR
MedAgentBench is a released benchmark and simulated EHR environment that measures how well LLMs can act as medical agents. It includes 300 clinician-written tasks across 10 categories, 100 realistic patient profiles with ~785k records, and a FHIR-compliant API + Docker image. The authors evaluated 12 LLMs (pass@1, max 8 interaction rounds). Best model (Claude 3.5 Sonnet v2) reached 69.67% overall success; query tasks beat action tasks. The suite is for development and research, not production deployment.
Problem Statement
There was no standard, interactive benchmark that tests LLMs as autonomous agents inside electronic health record (EHR) systems. That gap makes it hard to measure progress, compare models, and move agentic LLMs toward safe clinical use.
Main Contribution
MedAgentBench dataset: 300 clinician-written, verifiable tasks across 10 categories covering common EHR actions.
Interactive FHIR environment: 100 deidentified patient profiles (~785k records), HAPI FHIR server, Docker image, and standard API hooks to run agents.
Key Findings
Top-performing model achieved non-perfect but substantial task success.
Information-retrieval tasks are easier than action-modifying tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| models evaluated | 12 | — | — | MedAgentBench | Twelve state-of-the-art LLMs were benchmarked | Section 2.4.2 and Table 3 |
| best overall success rate | 69.67% | — | — | MedAgentBench overall | Claude 3.5 Sonnet v2 overall SR = 69.67% | Table 3 |
What To Try In 7 Days
Run MedAgentBench on your candidate LLMs to identify query vs action weaknesses.
Start automations with read-only tasks (reports/lookup) where models perform better.
Add strict output-schema checks and logging before enabling any write actions.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Environment simulates EHR via FHIR but lacks production security, logging, and enterprise integrations.
Patient profiles come from a single institution (STARR) and may not generalize.
When Not To Use
Do not use this as a drop-in test for production safety or compliance without extra security measures.
Not suitable for multimodal tasks (images, waveforms) or real-time clinical decision making without human review.
Failure Modes
Invalid API formatting or malformed GET/POST requests.
Wrong output format (free text vs expected structured value).

