A 30-task benchmark that tests agents on end-to-end ML development workflows

February 3, 20256 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible with public code and traces, but evaluations are limited to a few agents and single-run aggregates, so conclusions should be validated per team and model.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Harshith Padigela, Chintan Shah, Dinkar Juyal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agents can automate well-scoped ML tasks (data prep, basic debugging) today but fail at open-ended model improvement; firms should use agents to speed routine work and keep human oversight for strategic model changes.

Who Should Care

Summary TLDR

ML-Dev-Bench is an open-source benchmark of 30 practical ML development tasks (dataset handling, training, debugging, model implementation, API integration, and performance tuning). The paper runs three agent families (ReAct, Openhands, AIDE) with several LLM backends and reports per-task binary success. Openhands with Claude Sonnet performs best (50% overall). All agents fail open-ended model-improvement tasks (0% success). The repo and run traces are public for reproduction and extension.

Problem Statement

Existing LLM and agent benchmarks test isolated coding or Kaggle-style problems but fail to capture end-to-end ML development: working on existing codebases, long-running training, tool integrations, debugging across files, and iterative performance tuning. The paper fills that gap with a practical, runnable 30-task suite.

Main Contribution

A 30-task benchmark (ML-Dev-Bench) that mimics real ML development workflows and spans dataset handling, training, debugging, implementation, API integration, and performance tuning.

An evaluation framework (Calipers) that runs agents, captures artifacts, and applies binary validation checks.

Key Findings

Openhands (Claude Sonnet) achieved the highest overall success rate.

Numbers50% (15/30 tasks)

Practical UseIf you want a generalist agent for routine ML dev tasks today, start with Openhands+Claude Sonnet and test it on your repo.

Evidence RefTable 2; Section 6

ReAct (Claude Sonnet) performed close behind on common tasks.

Numbers47% (14/30 tasks)

Practical UseA simpler ReAct agent can handle many well-scoped tasks; tune its long-run orchestration and verification policy.

Evidence RefTable 2; Section 6.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall success rate (ReAct-Sonnet)47% (14/30)30-task ML-Dev-BenchReAct-Sonnet overall successTable 2
Overall success rate (OpenHands-Sonnet)50% (15/30)30-task ML-Dev-BenchOpenHands-Sonnet overall successTable 2

What To Try In 7 Days

Run ML-Dev-Bench's dataset and training tasks on your agent to measure real costs and gaps (link in repo).

Automate dataset download and preprocessing pipelines using an agent, then add unit tests to validate outputs.

Measure token cost per long-running job and enable prompt caching or lower-cost model backends for expensive tasks.

Agent Features

Memory
Short-term step-limited execution (50-step cap)Prompt caching for token efficiency
Planning
Action planning via tool callsBackground process orchestration (spawn/sleep/monitor)Tree-search solution planning (AIDE)
Tool Use
Shell and spawn commandsFile create/edit/list toolsWandB logging integration
Frameworks
LangGraphComposioOpenhandsAIDECaliperslitellm
Is Agentic

Yes

Architectures
LLM-driven agents (Claude Sonnet, GPT-4o, Gemini)
Collaboration
Not evaluated (single-agent runs)

Optimization Features

Token Efficiency
Prompt caching reduces cost in Openhands
Infra Optimization
Use of background process tools to run long jobs

Reproducibility

Risks & Boundaries

Limitations

Binary success metric hides partial progress and useful intermediate artifacts

Evaluations are single-run aggregates; variance across runs not reported

When Not To Use

For evaluating low-level model research or algorithmic innovations

To judge incremental partial progress where intermediate outputs matter

Failure Modes

Excessive verification prompts causing extra tokens and stalls

Premature termination before long-running tasks finish

Core Entities

Models

Claude Sonnet 3.5 (10-2022)OpenAI GPT-4oGemini 2.0 FlashGPT-4o (used with AIDE)o1/o3 (mentioned future work)

Metrics

binary success ratetoken usagetoken cost ($)number of steps

Datasets

Noisy ImagenetteImagenetteCIFAR-10CIFAR-100TinyBERT EvalSegmentation datasets (unspecified)

Benchmarks

SWE-BenchML-BenchMLE-BenchMLAgentBenchML-Dev-Bench (this paper)