Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
MedAgentBench provides a practical testbed to measure LLM agents on real EHR tasks so teams can benchmark readiness before risky EHR integration.
Summary TLDR
MedAgentBench is a released benchmark and simulated EHR environment that measures how well LLMs can act as medical agents. It includes 300 clinician-written tasks across 10 categories, 100 realistic patient profiles with ~785k records, and a FHIR-compliant API + Docker image. The authors evaluated 12 LLMs (pass@1, max 8 interaction rounds). Best model (Claude 3.5 Sonnet v2) reached 69.67% overall success; query tasks beat action tasks. The suite is for development and research, not production deployment.
Problem Statement
There was no standard, interactive benchmark that tests LLMs as autonomous agents inside electronic health record (EHR) systems. That gap makes it hard to measure progress, compare models, and move agentic LLMs toward safe clinical use.
Main Contribution
MedAgentBench dataset: 300 clinician-written, verifiable tasks across 10 categories covering common EHR actions.
Interactive FHIR environment: 100 deidentified patient profiles (~785k records), HAPI FHIR server, Docker image, and standard API hooks to run agents.
Baseline evaluations: pass@1 evaluation of 12 state-of-the-art LLMs with public results and error analysis to show current capabilities and gaps.
Key Findings
Top-performing model achieved non-perfect but substantial task success.
Information-retrieval tasks are easier than action-modifying tasks.
Task difficulty and step count strongly affect success.
Common failure modes are formatting and invalid API calls.
Results
models evaluated
best overall success rate
best query vs action gap
records and patient profiles
Who Should Care
What To Try In 7 Days
Run MedAgentBench on your candidate LLMs to identify query vs action weaknesses.
Start automations with read-only tasks (reports/lookup) where models perform better.
Add strict output-schema checks and logging before enabling any write actions.
Agent Features
Memory
- short-term conversation rounds (max 8)
Planning
- plans via selecting FHIR function calls (GET/POST)
Tool Use
- FHIR REST API calls
- GET/POST function invocation
Frameworks
- AgentBench-inspired orchestrator
Is Agentic
true
Architectures
- single LLM orchestrator (baseline)
Collaboration
- not multi-agent; single-agent baseline
Optimization Features
Infra Optimization
- Docker container for environment deployment
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Environment simulates EHR via FHIR but lacks production security, logging, and enterprise integrations.
- Patient profiles come from a single institution (STARR) and may not generalize.
- Evaluation used pass@1 and a simple orchestrator; other agent designs may behave differently.
- Benchmark focuses on medical-record interactions; it omits surgical, multimodal, and team-coordination tasks.
When Not To Use
- Do not use this as a drop-in test for production safety or compliance without extra security measures.
- Not suitable for multimodal tasks (images, waveforms) or real-time clinical decision making without human review.
- Avoid using benchmarked open-weight results as sole evidence for clinical deployment.
Failure Modes
- Invalid API formatting or malformed GET/POST requests.
- Wrong output format (free text vs expected structured value).
- Incorrect modifications to records (logic or dosing errors).
- High failure rate on multi-step/hard tasks (chaining errors).
Core Entities
Models
- Claude 3.5 Sonnet v2
- GPT-4o
- GPT-4o mini
- o3-mini
- Gemini 2.0 Pro
- Gemini 2.0 Flash
- Gemini 1.5 Pro
- DeepSeek-V3
- Qwen2.5
- Llama 3.3
- Gemma2
- Mistral v0.3
Metrics
- task success rate (SR)
- query SR
- action SR
- difficulty-stratified SR
Datasets
- MedAgentBench (this paper)
- STARR deidentified EHR extract (patient source)
Benchmarks
- AgentBench
- AgentClinic
- AgentBoard

