MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Overview

Decision SnapshotNeeds Validation

The benchmark is well-documented and reproducible, but current agent performance is far below human solutions; use it for evaluation and research, not for production research automation.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 20%

Novelty: 40%

Authors

Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang

Links

Abstract / PDF / Code

Why It Matters For Business

If you plan to use LLM agents for real research work, expect them to assist but not replace expert researchers; measure ideas with objective tests and track cost, runtime, and code complexity.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

MLRC-BENCH is a benchmark of seven real ML research competitions adapted into a repo-level, compute-constrained environment to test language-model-based research agents. It measures whether agents can propose novel methods and implement them end-to-end using objective metrics (effectiveness, runtime, lines of code). Results show current agents rarely close the gap to top human solutions (best model closed 9.3% on average) and that LLM-as-a-judge often misaligns with real performance.

Problem Statement

Can language-model-driven research agents both invent genuinely new ML methods and implement them well enough to meaningfully beat baselines and approach top human solutions? The paper builds a reproducible, repository-level benchmark from real conference competitions to measure idea novelty plus empirical effectiveness under realistic compute limits.

Main Contribution

A reproducible benchmark (MLRC-BENCH) that converts seven recent ML conference competitions into repo-level tasks with starter code, dev/test splits, and runtime/GPU constraints.

A protocol that scores agent solutions by objective metrics (Relative Improvement to Human, runtime, lines of code) to measure both novelty and effectiveness.

Key Findings

Best tested agent (gemini-exp-1206 under MLAB) closed only a small fraction of human gap.

Numbers9.3% average Relative Improvement to Human (Table 3)

Practical UseDon't expect current LLM agents to replace expert researchers; use them for scaffolding or low-effort prototypes, not final research-grade solutions.

Evidence RefTable 3

LLM-based subjective novelty scores do not predict empirical success.

NumbersSpearman corr ≈ -0.06 between innovativeness and effectiveness (Section 4.3)

Practical UseAlways validate agent-generated ideas with objective metrics and hidden test sets rather than LLM judgment alone.

Evidence RefSection 4.3, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average Relative Improvement to Human (best agent across tasks)	9.3%	0 = baseline, 100 = top human	—	avg across 7 tasks (Table 3)	gemini-exp-1206 average Relative Improvement to Human = 9.3% (Table 3)	Table 3
Pass@1 on backdoor-trigger-recovery	0.31 (Human Idea + MLAB) vs 0.12 (MLAB-only)	MLAB-only	×2.58 improvement	backdoor-trigger-recovery (Table 5)	Table 5 pass@1 values	Table 5

What To Try In 7 Days

Run MLRC-BENCH on a representative internal repo-level task to benchmark your agent vs a human baseline.

Add a human ideation step before automated implementation and compare pass@k and cost.

Track objective metrics (Relative Improvement to Human, runtime, LLoC) instead of relying on LLM-judged novelty.

Agent Features

Memory

short-term internal memory across steps

Planning

iterative refinement (multi-step actions and executions)LoRA

Tool Use

file operations (list, edit, execute)python runtime and test executionweb retrieval for ideation (CoI-Agent)

Frameworks

MLABCoI-Agent (Chain-of-Ideas)Human Idea + MLAB

Is Agentic

Yes

Architectures

ReAct-style reasoning-action loop

Collaboration

ideation + implementation split (CoI-Agent ideas consumed by MLAB)supports human-in-the-loop idea injection

Optimization Features

Token Efficiency

cost-effectiveness trade-offs considered (Section 4.5)

Infra Optimization

explicit runtime and GPU memory limits per task to encourage efficient methods

Inference Optimization

inference-time scaling via repeated trials and sampling (Section 4.6)pass@k-style scaling experiments

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yunx-z/MLRC-Bench https://huggingface.co/spaces/launch/MLRC_Bench

Risks & Boundaries

Limitations

Only seven curated tasks in the current release; coverage will expand over time (Section 3.2).

Majority of experiments use one primary agent scaffold (MLAB), so results may not generalize to all agent frameworks (Section 4.1).

When Not To Use

To evaluate single-file coding agents on trivial Kaggle-style tasks that don't require methodological novelty.

As a sole arbiter of idea quality—subjective LLM judgments are unreliable proxies for real performance.

Failure Modes

Tool-argument hallucinations leading to action errors (11.5% of steps) and failed executions (Section 4.4, F.1).

Over-refinement where code size and runtime grow without commensurate performance gains (Figure 4).

Core Entities

Models

gemini-exp-1206claude-3-5-sonnet-v2llama3-1-405b-instructo3-minigpt-4oo1

Metrics

Relative Improvement to Human (normalized main metric)AccuracyEfficiency (runtime)Simplicity (logical lines of code, LLoC)Pass@k (success probability metric)

Datasets

LLM Merging (NeurIPS 2024 competition)Backdoor Trigger Recovery (NeurIPS 2024)Temporal Action Localization (ECCV 2024 Perception Test)Rainfall Prediction / Weather4cast (NeurIPS 2022/2023)Machine Unlearning (NeurIPS Unlearning 2023)Next Product Recommendation (KDD Cup 2023)Cross-Domain Meta-Learning (NeurIPS 2022)

Benchmarks

MLRC-BENCH (this paper)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Best tested agent (gemini-exp-1206 under MLAB) closed only a small fraction of human gap.

LLM-based subjective novelty scores do not predict empirical success.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding