Overview
The benchmark is practical and reproducible with public code and traces, but evaluations are limited to a few agents and single-run aggregates, so conclusions should be validated per team and model.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Agents can automate well-scoped ML tasks (data prep, basic debugging) today but fail at open-ended model improvement; firms should use agents to speed routine work and keep human oversight for strategic model changes.
Who Should Care
Summary TLDR
ML-Dev-Bench is an open-source benchmark of 30 practical ML development tasks (dataset handling, training, debugging, model implementation, API integration, and performance tuning). The paper runs three agent families (ReAct, Openhands, AIDE) with several LLM backends and reports per-task binary success. Openhands with Claude Sonnet performs best (50% overall). All agents fail open-ended model-improvement tasks (0% success). The repo and run traces are public for reproduction and extension.
Problem Statement
Existing LLM and agent benchmarks test isolated coding or Kaggle-style problems but fail to capture end-to-end ML development: working on existing codebases, long-running training, tool integrations, debugging across files, and iterative performance tuning. The paper fills that gap with a practical, runnable 30-task suite.
Main Contribution
A 30-task benchmark (ML-Dev-Bench) that mimics real ML development workflows and spans dataset handling, training, debugging, implementation, API integration, and performance tuning.
An evaluation framework (Calipers) that runs agents, captures artifacts, and applies binary validation checks.
Key Findings
Openhands (Claude Sonnet) achieved the highest overall success rate.
ReAct (Claude Sonnet) performed close behind on common tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall success rate (ReAct-Sonnet) | 47% (14/30) | — | — | 30-task ML-Dev-Bench | ReAct-Sonnet overall success | Table 2 |
| Overall success rate (OpenHands-Sonnet) | 50% (15/30) | — | — | 30-task ML-Dev-Bench | OpenHands-Sonnet overall success | Table 2 |
What To Try In 7 Days
Run ML-Dev-Bench's dataset and training tasks on your agent to measure real costs and gaps (link in repo).
Automate dataset download and preprocessing pipelines using an agent, then add unit tests to validate outputs.
Measure token cost per long-running job and enable prompt caching or lower-cost model backends for expensive tasks.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Binary success metric hides partial progress and useful intermediate artifacts
Evaluations are single-run aggregates; variance across runs not reported
When Not To Use
For evaluating low-level model research or algorithmic innovations
To judge incremental partial progress where intermediate outputs matter
Failure Modes
Excessive verification prompts causing extra tokens and stalls
Premature termination before long-running tasks finish

