Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
Agents can automate well-scoped ML tasks (data prep, basic debugging) today but fail at open-ended model improvement; firms should use agents to speed routine work and keep human oversight for strategic model changes.
Summary TLDR
ML-Dev-Bench is an open-source benchmark of 30 practical ML development tasks (dataset handling, training, debugging, model implementation, API integration, and performance tuning). The paper runs three agent families (ReAct, Openhands, AIDE) with several LLM backends and reports per-task binary success. Openhands with Claude Sonnet performs best (50% overall). All agents fail open-ended model-improvement tasks (0% success). The repo and run traces are public for reproduction and extension.
Problem Statement
Existing LLM and agent benchmarks test isolated coding or Kaggle-style problems but fail to capture end-to-end ML development: working on existing codebases, long-running training, tool integrations, debugging across files, and iterative performance tuning. The paper fills that gap with a practical, runnable 30-task suite.
Main Contribution
A 30-task benchmark (ML-Dev-Bench) that mimics real ML development workflows and spans dataset handling, training, debugging, implementation, API integration, and performance tuning.
An evaluation framework (Calipers) that runs agents, captures artifacts, and applies binary validation checks.
A comparative evaluation of three agent setups (ReAct, Openhands, AIDE) across multiple LLM backends with token and cost measurements and published traces.
Key Findings
Openhands (Claude Sonnet) achieved the highest overall success rate.
ReAct (Claude Sonnet) performed close behind on common tasks.
No agent succeeded at open-ended model-improvement tasks.
Agents reliably solve narrowly scoped tasks like dataset handling.
Token/cost patterns vary by agent and task type.
Results
Overall success rate (ReAct-Sonnet)
Overall success rate (OpenHands-Sonnet)
Model Performance category success
Dataset Handling success (ReAct & OpenHands)
Example token cost (ChannelViT)
Who Should Care
What To Try In 7 Days
Run ML-Dev-Bench's dataset and training tasks on your agent to measure real costs and gaps (link in repo).
Automate dataset download and preprocessing pipelines using an agent, then add unit tests to validate outputs.
Measure token cost per long-running job and enable prompt caching or lower-cost model backends for expensive tasks.
Agent Features
Memory
- Short-term step-limited execution (50-step cap)
- Prompt caching for token efficiency
Planning
- Action planning via tool calls
- Background process orchestration (spawn/sleep/monitor)
- Tree-search solution planning (AIDE)
Tool Use
- Shell and spawn commands
- File create/edit/list tools
- WandB logging integration
Frameworks
- LangGraph
- Composio
- Openhands
- AIDE
- Calipers
- litellm
Is Agentic
true
Architectures
- LLM-driven agents (Claude Sonnet, GPT-4o, Gemini)
Collaboration
- Not evaluated (single-agent runs)
Optimization Features
Token Efficiency
- Prompt caching reduces cost in Openhands
Infra Optimization
- Use of background process tools to run long jobs
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Binary success metric hides partial progress and useful intermediate artifacts
- Evaluations are single-run aggregates; variance across runs not reported
- Only a few agents and LLM backends tested; results may not generalize to other models
- Open-ended model-improvement tasks are under-specified for automated judging
When Not To Use
- For evaluating low-level model research or algorithmic innovations
- To judge incremental partial progress where intermediate outputs matter
Failure Modes
- Excessive verification prompts causing extra tokens and stalls
- Premature termination before long-running tasks finish
- Failing to edit files correctly or follow artifact-creation instructions
- Incorrect file edits or changing tests despite instructions
Core Entities
Models
- Claude Sonnet 3.5 (10-2022)
- OpenAI GPT-4o
- Gemini 2.0 Flash
- GPT-4o (used with AIDE)
- o1/o3 (mentioned future work)
Metrics
- binary success rate
- token usage
- token cost ($)
- number of steps
Datasets
- Noisy Imagenette
- Imagenette
- CIFAR-10
- CIFAR-100
- TinyBERT Eval
- Segmentation datasets (unspecified)
Benchmarks
- SWE-Bench
- ML-Bench
- MLE-Bench
- MLAgentBench
- ML-Dev-Bench (this paper)

