MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

April 13, 20258 min

Overview

Decision SnapshotNeeds Validation

The benchmark is well-documented and reproducible, but current agent performance is far below human solutions; use it for evaluation and research, not for production research automation.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 20%

Novelty: 40%

Authors

Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Links

Abstract / PDF / Code

Why It Matters For Business

If you plan to use LLM agents for real research work, expect them to assist but not replace expert researchers; measure ideas with objective tests and track cost, runtime, and code complexity.

Who Should Care

Summary TLDR

MLRC-BENCH is a benchmark of seven real ML research competitions adapted into a repo-level, compute-constrained environment to test language-model-based research agents. It measures whether agents can propose novel methods and implement them end-to-end using objective metrics (effectiveness, runtime, lines of code). Results show current agents rarely close the gap to top human solutions (best model closed 9.3% on average) and that LLM-as-a-judge often misaligns with real performance.

Problem Statement

Can language-model-driven research agents both invent genuinely new ML methods and implement them well enough to meaningfully beat baselines and approach top human solutions? The paper builds a reproducible, repository-level benchmark from real conference competitions to measure idea novelty plus empirical effectiveness under realistic compute limits.

Main Contribution

A reproducible benchmark (MLRC-BENCH) that converts seven recent ML conference competitions into repo-level tasks with starter code, dev/test splits, and runtime/GPU constraints.

A protocol that scores agent solutions by objective metrics (Relative Improvement to Human, runtime, lines of code) to measure both novelty and effectiveness.

Key Findings

Best tested agent (gemini-exp-1206 under MLAB) closed only a small fraction of human gap.

Numbers9.3% average Relative Improvement to Human (Table 3)

Practical UseDon't expect current LLM agents to replace expert researchers; use them for scaffolding or low-effort prototypes, not final research-grade solutions.

Evidence RefTable 3

LLM-based subjective novelty scores do not predict empirical success.

NumbersSpearman corr ≈ -0.06 between innovativeness and effectiveness (Section 4.3)

Practical UseAlways validate agent-generated ideas with objective metrics and hidden test sets rather than LLM judgment alone.

Evidence RefSection 4.3, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average Relative Improvement to Human (best agent across tasks)9.3%0 = baseline, 100 = top humanavg across 7 tasks (Table 3)gemini-exp-1206 average Relative Improvement to Human = 9.3% (Table 3)Table 3
Pass@1 on backdoor-trigger-recovery0.31 (Human Idea + MLAB) vs 0.12 (MLAB-only)MLAB-only×2.58 improvementbackdoor-trigger-recovery (Table 5)Table 5 pass@1 valuesTable 5

What To Try In 7 Days

Run MLRC-BENCH on a representative internal repo-level task to benchmark your agent vs a human baseline.

Add a human ideation step before automated implementation and compare pass@k and cost.

Track objective metrics (Relative Improvement to Human, runtime, LLoC) instead of relying on LLM-judged novelty.

Agent Features

Memory
short-term internal memory across steps
Planning
iterative refinement (multi-step actions and executions)LoRA
Tool Use
file operations (list, edit, execute)python runtime and test executionweb retrieval for ideation (CoI-Agent)
Frameworks
MLABCoI-Agent (Chain-of-Ideas)Human Idea + MLAB
Is Agentic

Yes

Architectures
ReAct-style reasoning-action loop
Collaboration
ideation + implementation split (CoI-Agent ideas consumed by MLAB)supports human-in-the-loop idea injection

Optimization Features

Token Efficiency
cost-effectiveness trade-offs considered (Section 4.5)
Infra Optimization
explicit runtime and GPU memory limits per task to encourage efficient methods
Inference Optimization
inference-time scaling via repeated trials and sampling (Section 4.6)pass@k-style scaling experiments

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only seven curated tasks in the current release; coverage will expand over time (Section 3.2).

Majority of experiments use one primary agent scaffold (MLAB), so results may not generalize to all agent frameworks (Section 4.1).

When Not To Use

To evaluate single-file coding agents on trivial Kaggle-style tasks that don't require methodological novelty.

As a sole arbiter of idea quality—subjective LLM judgments are unreliable proxies for real performance.

Failure Modes

Tool-argument hallucinations leading to action errors (11.5% of steps) and failed executions (Section 4.4, F.1).

Over-refinement where code size and runtime grow without commensurate performance gains (Figure 4).

Core Entities

Models

gemini-exp-1206claude-3-5-sonnet-v2llama3-1-405b-instructo3-minigpt-4oo1

Metrics

Relative Improvement to Human (normalized main metric)AccuracyEfficiency (runtime)Simplicity (logical lines of code, LLoC)Pass@k (success probability metric)

Datasets

LLM Merging (NeurIPS 2024 competition)Backdoor Trigger Recovery (NeurIPS 2024)Temporal Action Localization (ECCV 2024 Perception Test)Rainfall Prediction / Weather4cast (NeurIPS 2022/2023)Machine Unlearning (NeurIPS Unlearning 2023)Next Product Recommendation (KDD Cup 2023)Cross-Domain Meta-Learning (NeurIPS 2022)

Benchmarks

MLRC-BENCH (this paper)