A 30-task benchmark that tests agents on end-to-end ML development workflows

February 3, 20256 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Harshith Padigela, Chintan Shah, Dinkar Juyal

Links

Abstract / PDF

Why It Matters For Business

Agents can automate well-scoped ML tasks (data prep, basic debugging) today but fail at open-ended model improvement; firms should use agents to speed routine work and keep human oversight for strategic model changes.

Summary TLDR

ML-Dev-Bench is an open-source benchmark of 30 practical ML development tasks (dataset handling, training, debugging, model implementation, API integration, and performance tuning). The paper runs three agent families (ReAct, Openhands, AIDE) with several LLM backends and reports per-task binary success. Openhands with Claude Sonnet performs best (50% overall). All agents fail open-ended model-improvement tasks (0% success). The repo and run traces are public for reproduction and extension.

Problem Statement

Existing LLM and agent benchmarks test isolated coding or Kaggle-style problems but fail to capture end-to-end ML development: working on existing codebases, long-running training, tool integrations, debugging across files, and iterative performance tuning. The paper fills that gap with a practical, runnable 30-task suite.

Main Contribution

A 30-task benchmark (ML-Dev-Bench) that mimics real ML development workflows and spans dataset handling, training, debugging, implementation, API integration, and performance tuning.

An evaluation framework (Calipers) that runs agents, captures artifacts, and applies binary validation checks.

A comparative evaluation of three agent setups (ReAct, Openhands, AIDE) across multiple LLM backends with token and cost measurements and published traces.

Key Findings

Openhands (Claude Sonnet) achieved the highest overall success rate.

Numbers50% (15/30 tasks)

ReAct (Claude Sonnet) performed close behind on common tasks.

Numbers47% (14/30 tasks)

No agent succeeded at open-ended model-improvement tasks.

Numbers0% success on Model Performance tasks (0/6)

Agents reliably solve narrowly scoped tasks like dataset handling.

NumbersDataset Handling: ReAct/ OH both 100% (3/3)

Token/cost patterns vary by agent and task type.

NumbersChannelViT: ReAct $1.06 vs OH $0.215; many tasks span $0.02–$3.16

Results

Overall success rate (ReAct-Sonnet)

Value47% (14/30)

Overall success rate (OpenHands-Sonnet)

Value50% (15/30)

Model Performance category success

Value0% (0/6)

Dataset Handling success (ReAct & OpenHands)

Value100% (3/3)

Example token cost (ChannelViT)

ValueReAct $1.06 vs OH $0.215

Who Should Care

What To Try In 7 Days

Run ML-Dev-Bench's dataset and training tasks on your agent to measure real costs and gaps (link in repo).

Automate dataset download and preprocessing pipelines using an agent, then add unit tests to validate outputs.

Measure token cost per long-running job and enable prompt caching or lower-cost model backends for expensive tasks.

Agent Features

Memory

  • Short-term step-limited execution (50-step cap)
  • Prompt caching for token efficiency

Planning

  • Action planning via tool calls
  • Background process orchestration (spawn/sleep/monitor)
  • Tree-search solution planning (AIDE)

Tool Use

  • Shell and spawn commands
  • File create/edit/list tools
  • WandB logging integration

Frameworks

  • LangGraph
  • Composio
  • Openhands
  • AIDE
  • Calipers
  • litellm

Is Agentic

true

Architectures

  • LLM-driven agents (Claude Sonnet, GPT-4o, Gemini)

Collaboration

  • Not evaluated (single-agent runs)

Optimization Features

Token Efficiency

  • Prompt caching reduces cost in Openhands

Infra Optimization

  • Use of background process tools to run long jobs

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Binary success metric hides partial progress and useful intermediate artifacts
  • Evaluations are single-run aggregates; variance across runs not reported
  • Only a few agents and LLM backends tested; results may not generalize to other models
  • Open-ended model-improvement tasks are under-specified for automated judging

When Not To Use

  • For evaluating low-level model research or algorithmic innovations
  • To judge incremental partial progress where intermediate outputs matter

Failure Modes

  • Excessive verification prompts causing extra tokens and stalls
  • Premature termination before long-running tasks finish
  • Failing to edit files correctly or follow artifact-creation instructions
  • Incorrect file edits or changing tests despite instructions

Core Entities

Models

  • Claude Sonnet 3.5 (10-2022)
  • OpenAI GPT-4o
  • Gemini 2.0 Flash
  • GPT-4o (used with AIDE)
  • o1/o3 (mentioned future work)

Metrics

  • binary success rate
  • token usage
  • token cost ($)
  • number of steps

Datasets

  • Noisy Imagenette
  • Imagenette
  • CIFAR-10
  • CIFAR-100
  • TinyBERT Eval
  • Segmentation datasets (unspecified)

Benchmarks

  • SWE-Bench
  • ML-Bench
  • MLE-Bench
  • MLAgentBench
  • ML-Dev-Bench (this paper)