MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

April 13, 20258 min

Overview

Production Readiness

0.2

Novelty Score

0.4

Cost Impact Score

0.35

Citation Count

0

Authors

Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Links

Abstract / PDF

Why It Matters For Business

If you plan to use LLM agents for real research work, expect them to assist but not replace expert researchers; measure ideas with objective tests and track cost, runtime, and code complexity.

Summary TLDR

MLRC-BENCH is a benchmark of seven real ML research competitions adapted into a repo-level, compute-constrained environment to test language-model-based research agents. It measures whether agents can propose novel methods and implement them end-to-end using objective metrics (effectiveness, runtime, lines of code). Results show current agents rarely close the gap to top human solutions (best model closed 9.3% on average) and that LLM-as-a-judge often misaligns with real performance.

Problem Statement

Can language-model-driven research agents both invent genuinely new ML methods and implement them well enough to meaningfully beat baselines and approach top human solutions? The paper builds a reproducible, repository-level benchmark from real conference competitions to measure idea novelty plus empirical effectiveness under realistic compute limits.

Main Contribution

A reproducible benchmark (MLRC-BENCH) that converts seven recent ML conference competitions into repo-level tasks with starter code, dev/test splits, and runtime/GPU constraints.

A protocol that scores agent solutions by objective metrics (Relative Improvement to Human, runtime, lines of code) to measure both novelty and effectiveness.

Large-scale evaluations of multiple LLMs and agent scaffoldings, showing agents struggle to generate and implement novel, effective methods and that LLM-judged novelty poorly correlates with objective gains.

Key Findings

Best tested agent (gemini-exp-1206 under MLAB) closed only a small fraction of human gap.

Numbers9.3% average Relative Improvement to Human (Table 3)

LLM-based subjective novelty scores do not predict empirical success.

NumbersSpearman corr ≈ -0.06 between innovativeness and effectiveness (Section 4.3)

Human-provided ideas speed up successful discovery compared to implementation-only runs.

Numbersbackdoor-trigger pass@1: 0.31 (Human Idea) vs 0.12 (MLAB) (Table 5)

Agents tend to increase runtime and code size with little performance gain during iterative refinement.

NumbersAgents expand lines of code and runtime while performance plateaus (Figure 4)

Tool-usage and debugging remain key failure points.

Numbers11.5% of agent steps are action errors from incorrect tool arguments; only 17.2% of execution errors are fully fixed (F.

Results

Average Relative Improvement to Human (best agent across tasks)

Value9.3%

Baseline0 = baseline, 100 = top human

Pass@1 on backdoor-trigger-recovery

Value0.31 (Human Idea + MLAB) vs 0.12 (MLAB-only)

BaselineMLAB-only

Correlation between judged innovativeness and empirical effectiveness

Value-0.06 (near-zero)

BaselineSpearman correlation

Who Should Care

What To Try In 7 Days

Run MLRC-BENCH on a representative internal repo-level task to benchmark your agent vs a human baseline.

Add a human ideation step before automated implementation and compare pass@k and cost.

Track objective metrics (Relative Improvement to Human, runtime, LLoC) instead of relying on LLM-judged novelty.

Agent Features

Memory

  • short-term internal memory across steps

Planning

  • iterative refinement (multi-step actions and executions)
  • LoRA

Tool Use

  • file operations (list, edit, execute)
  • python runtime and test execution
  • web retrieval for ideation (CoI-Agent)

Frameworks

  • MLAB
  • CoI-Agent (Chain-of-Ideas)
  • Human Idea + MLAB

Is Agentic

true

Architectures

  • ReAct-style reasoning-action loop

Collaboration

  • ideation + implementation split (CoI-Agent ideas consumed by MLAB)
  • supports human-in-the-loop idea injection

Optimization Features

Token Efficiency

  • cost-effectiveness trade-offs considered (Section 4.5)

Infra Optimization

  • explicit runtime and GPU memory limits per task to encourage efficient methods

Inference Optimization

  • inference-time scaling via repeated trials and sampling (Section 4.6)
  • pass@k-style scaling experiments

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only seven curated tasks in the current release; coverage will expand over time (Section 3.2).
  • Majority of experiments use one primary agent scaffold (MLAB), so results may not generalize to all agent frameworks (Section 4.1).
  • Best-of-8 trials reported; no full statistical significance testing or large-run variance estimates (NeurIPS checklist, Section 7).
  • Some tasks rely on competition test APIs or reproduced splits; subtle differences from original competitions may exist.

When Not To Use

  • To evaluate single-file coding agents on trivial Kaggle-style tasks that don't require methodological novelty.
  • As a sole arbiter of idea quality—subjective LLM judgments are unreliable proxies for real performance.
  • For immediate production automation of research without human oversight; agents underperform experts.

Failure Modes

  • Tool-argument hallucinations leading to action errors (11.5% of steps) and failed executions (Section 4.4, F.1).
  • Over-refinement where code size and runtime grow without commensurate performance gains (Figure 4).
  • Poor self-debugging: only ~17.2% of encountered execution errors fully fixed (Section 4.4).
  • LLM-as-a-judge giving favorable subjective scores that do not match objective improvements (Section 4.3).

Core Entities

Models

  • gemini-exp-1206
  • claude-3-5-sonnet-v2
  • llama3-1-405b-instruct
  • o3-mini
  • gpt-4o
  • o1

Metrics

  • Relative Improvement to Human (normalized main metric)
  • Accuracy
  • Efficiency (runtime)
  • Simplicity (logical lines of code, LLoC)
  • Pass@k (success probability metric)

Datasets

  • LLM Merging (NeurIPS 2024 competition)
  • Backdoor Trigger Recovery (NeurIPS 2024)
  • Temporal Action Localization (ECCV 2024 Perception Test)
  • Rainfall Prediction / Weather4cast (NeurIPS 2022/2023)
  • Machine Unlearning (NeurIPS Unlearning 2023)
  • Next Product Recommendation (KDD Cup 2023)
  • Cross-Domain Meta-Learning (NeurIPS 2022)

Benchmarks

  • MLRC-BENCH (this paper)