A 30-task benchmark that tests agents on end-to-end ML development workflows

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible with public code and traces, but evaluations are limited to a few agents and single-run aggregates, so conclusions should be validated per team and model.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Harshith Padigela, Chintan Shah, Dinkar Juyal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agents can automate well-scoped ML tasks (data prep, basic debugging) today but fail at open-ended model improvement; firms should use agents to speed routine work and keep human oversight for strategic model changes.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

ML-Dev-Bench is an open-source benchmark of 30 practical ML development tasks (dataset handling, training, debugging, model implementation, API integration, and performance tuning). The paper runs three agent families (ReAct, Openhands, AIDE) with several LLM backends and reports per-task binary success. Openhands with Claude Sonnet performs best (50% overall). All agents fail open-ended model-improvement tasks (0% success). The repo and run traces are public for reproduction and extension.

Problem Statement

Existing LLM and agent benchmarks test isolated coding or Kaggle-style problems but fail to capture end-to-end ML development: working on existing codebases, long-running training, tool integrations, debugging across files, and iterative performance tuning. The paper fills that gap with a practical, runnable 30-task suite.

Main Contribution

A 30-task benchmark (ML-Dev-Bench) that mimics real ML development workflows and spans dataset handling, training, debugging, implementation, API integration, and performance tuning.

An evaluation framework (Calipers) that runs agents, captures artifacts, and applies binary validation checks.

Key Findings

Openhands (Claude Sonnet) achieved the highest overall success rate.

Numbers50% (15/30 tasks)

Practical UseIf you want a generalist agent for routine ML dev tasks today, start with Openhands+Claude Sonnet and test it on your repo.

Evidence RefTable 2; Section 6

ReAct (Claude Sonnet) performed close behind on common tasks.

Numbers47% (14/30 tasks)

Practical UseA simpler ReAct agent can handle many well-scoped tasks; tune its long-run orchestration and verification policy.

Evidence RefTable 2; Section 6.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall success rate (ReAct-Sonnet)	47% (14/30)	—	—	30-task ML-Dev-Bench	ReAct-Sonnet overall success	Table 2
Overall success rate (OpenHands-Sonnet)	50% (15/30)	—	—	30-task ML-Dev-Bench	OpenHands-Sonnet overall success	Table 2

What To Try In 7 Days

Run ML-Dev-Bench's dataset and training tasks on your agent to measure real costs and gaps (link in repo).

Automate dataset download and preprocessing pipelines using an agent, then add unit tests to validate outputs.

Measure token cost per long-running job and enable prompt caching or lower-cost model backends for expensive tasks.

Agent Features

Memory

Short-term step-limited execution (50-step cap)Prompt caching for token efficiency

Planning

Action planning via tool callsBackground process orchestration (spawn/sleep/monitor)Tree-search solution planning (AIDE)

Tool Use

Shell and spawn commandsFile create/edit/list toolsWandB logging integration

Frameworks

LangGraphComposioOpenhandsAIDECaliperslitellm

Is Agentic

Yes

Architectures

LLM-driven agents (Claude Sonnet, GPT-4o, Gemini)

Collaboration

Not evaluated (single-agent runs)

Optimization Features

Token Efficiency

Prompt caching reduces cost in Openhands

Infra Optimization

Use of background process tools to run long jobs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/ml-dev-bench/ml-dev-bench https://drive.google.com/drive/folders/1o1FCvx_n9XVKgvkSWL97LlsfOh_KfG-S?usp=sharing

Data URLs

https://github.com/ml-dev-bench/ml-dev-bench

Risks & Boundaries

Limitations

Binary success metric hides partial progress and useful intermediate artifacts

Evaluations are single-run aggregates; variance across runs not reported

When Not To Use

For evaluating low-level model research or algorithmic innovations

To judge incremental partial progress where intermediate outputs matter

Failure Modes

Excessive verification prompts causing extra tokens and stalls

Premature termination before long-running tasks finish

Core Entities

Models

Claude Sonnet 3.5 (10-2022)OpenAI GPT-4oGemini 2.0 FlashGPT-4o (used with AIDE)o1/o3 (mentioned future work)

Metrics

binary success ratetoken usagetoken cost ($)number of steps

Datasets

Noisy ImagenetteImagenetteCIFAR-10CIFAR-100TinyBERT EvalSegmentation datasets (unspecified)

Benchmarks

SWE-BenchML-BenchMLE-BenchMLAgentBenchML-Dev-Bench (this paper)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Openhands (Claude Sonnet) achieved the highest overall success rate.

ReAct (Claude Sonnet) performed close behind on common tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding