Overview
The benchmark is well-documented and reproducible, but current agent performance is far below human solutions; use it for evaluation and research, not for production research automation.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 35%
Production readiness: 20%
Novelty: 40%
Why It Matters For Business
If you plan to use LLM agents for real research work, expect them to assist but not replace expert researchers; measure ideas with objective tests and track cost, runtime, and code complexity.
Who Should Care
Summary TLDR
MLRC-BENCH is a benchmark of seven real ML research competitions adapted into a repo-level, compute-constrained environment to test language-model-based research agents. It measures whether agents can propose novel methods and implement them end-to-end using objective metrics (effectiveness, runtime, lines of code). Results show current agents rarely close the gap to top human solutions (best model closed 9.3% on average) and that LLM-as-a-judge often misaligns with real performance.
Problem Statement
Can language-model-driven research agents both invent genuinely new ML methods and implement them well enough to meaningfully beat baselines and approach top human solutions? The paper builds a reproducible, repository-level benchmark from real conference competitions to measure idea novelty plus empirical effectiveness under realistic compute limits.
Main Contribution
A reproducible benchmark (MLRC-BENCH) that converts seven recent ML conference competitions into repo-level tasks with starter code, dev/test splits, and runtime/GPU constraints.
A protocol that scores agent solutions by objective metrics (Relative Improvement to Human, runtime, lines of code) to measure both novelty and effectiveness.
Key Findings
Best tested agent (gemini-exp-1206 under MLAB) closed only a small fraction of human gap.
LLM-based subjective novelty scores do not predict empirical success.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average Relative Improvement to Human (best agent across tasks) | 9.3% | 0 = baseline, 100 = top human | — | avg across 7 tasks (Table 3) | gemini-exp-1206 average Relative Improvement to Human = 9.3% (Table 3) | Table 3 |
| Pass@1 on backdoor-trigger-recovery | 0.31 (Human Idea + MLAB) vs 0.12 (MLAB-only) | MLAB-only | ×2.58 improvement | backdoor-trigger-recovery (Table 5) | Table 5 pass@1 values | Table 5 |
What To Try In 7 Days
Run MLRC-BENCH on a representative internal repo-level task to benchmark your agent vs a human baseline.
Add a human ideation step before automated implementation and compare pass@k and cost.
Track objective metrics (Relative Improvement to Human, runtime, LLoC) instead of relying on LLM-judged novelty.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Only seven curated tasks in the current release; coverage will expand over time (Section 3.2).
Majority of experiments use one primary agent scaffold (MLAB), so results may not generalize to all agent frameworks (Section 4.1).
When Not To Use
To evaluate single-file coding agents on trivial Kaggle-style tasks that don't require methodological novelty.
As a sole arbiter of idea quality—subjective LLM judgments are unreliable proxies for real performance.
Failure Modes
Tool-argument hallucinations leading to action errors (11.5% of steps) and failed executions (Section 4.4, F.1).
Over-refinement where code size and runtime grow without commensurate performance gains (Figure 4).

