Overview
Production Readiness
0.2
Novelty Score
0.4
Cost Impact Score
0.35
Citation Count
0
Why It Matters For Business
If you plan to use LLM agents for real research work, expect them to assist but not replace expert researchers; measure ideas with objective tests and track cost, runtime, and code complexity.
Summary TLDR
MLRC-BENCH is a benchmark of seven real ML research competitions adapted into a repo-level, compute-constrained environment to test language-model-based research agents. It measures whether agents can propose novel methods and implement them end-to-end using objective metrics (effectiveness, runtime, lines of code). Results show current agents rarely close the gap to top human solutions (best model closed 9.3% on average) and that LLM-as-a-judge often misaligns with real performance.
Problem Statement
Can language-model-driven research agents both invent genuinely new ML methods and implement them well enough to meaningfully beat baselines and approach top human solutions? The paper builds a reproducible, repository-level benchmark from real conference competitions to measure idea novelty plus empirical effectiveness under realistic compute limits.
Main Contribution
A reproducible benchmark (MLRC-BENCH) that converts seven recent ML conference competitions into repo-level tasks with starter code, dev/test splits, and runtime/GPU constraints.
A protocol that scores agent solutions by objective metrics (Relative Improvement to Human, runtime, lines of code) to measure both novelty and effectiveness.
Large-scale evaluations of multiple LLMs and agent scaffoldings, showing agents struggle to generate and implement novel, effective methods and that LLM-judged novelty poorly correlates with objective gains.
Key Findings
Best tested agent (gemini-exp-1206 under MLAB) closed only a small fraction of human gap.
LLM-based subjective novelty scores do not predict empirical success.
Human-provided ideas speed up successful discovery compared to implementation-only runs.
Agents tend to increase runtime and code size with little performance gain during iterative refinement.
Tool-usage and debugging remain key failure points.
Results
Average Relative Improvement to Human (best agent across tasks)
Pass@1 on backdoor-trigger-recovery
Correlation between judged innovativeness and empirical effectiveness
Who Should Care
What To Try In 7 Days
Run MLRC-BENCH on a representative internal repo-level task to benchmark your agent vs a human baseline.
Add a human ideation step before automated implementation and compare pass@k and cost.
Track objective metrics (Relative Improvement to Human, runtime, LLoC) instead of relying on LLM-judged novelty.
Agent Features
Memory
- short-term internal memory across steps
Planning
- iterative refinement (multi-step actions and executions)
- LoRA
Tool Use
- file operations (list, edit, execute)
- python runtime and test execution
- web retrieval for ideation (CoI-Agent)
Frameworks
- MLAB
- CoI-Agent (Chain-of-Ideas)
- Human Idea + MLAB
Is Agentic
true
Architectures
- ReAct-style reasoning-action loop
Collaboration
- ideation + implementation split (CoI-Agent ideas consumed by MLAB)
- supports human-in-the-loop idea injection
Optimization Features
Token Efficiency
- cost-effectiveness trade-offs considered (Section 4.5)
Infra Optimization
- explicit runtime and GPU memory limits per task to encourage efficient methods
Inference Optimization
- inference-time scaling via repeated trials and sampling (Section 4.6)
- pass@k-style scaling experiments
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only seven curated tasks in the current release; coverage will expand over time (Section 3.2).
- Majority of experiments use one primary agent scaffold (MLAB), so results may not generalize to all agent frameworks (Section 4.1).
- Best-of-8 trials reported; no full statistical significance testing or large-run variance estimates (NeurIPS checklist, Section 7).
- Some tasks rely on competition test APIs or reproduced splits; subtle differences from original competitions may exist.
When Not To Use
- To evaluate single-file coding agents on trivial Kaggle-style tasks that don't require methodological novelty.
- As a sole arbiter of idea quality—subjective LLM judgments are unreliable proxies for real performance.
- For immediate production automation of research without human oversight; agents underperform experts.
Failure Modes
- Tool-argument hallucinations leading to action errors (11.5% of steps) and failed executions (Section 4.4, F.1).
- Over-refinement where code size and runtime grow without commensurate performance gains (Figure 4).
- Poor self-debugging: only ~17.2% of encountered execution errors fully fixed (Section 4.4).
- LLM-as-a-judge giving favorable subjective scores that do not match objective improvements (Section 4.3).
Core Entities
Models
- gemini-exp-1206
- claude-3-5-sonnet-v2
- llama3-1-405b-instruct
- o3-mini
- gpt-4o
- o1
Metrics
- Relative Improvement to Human (normalized main metric)
- Accuracy
- Efficiency (runtime)
- Simplicity (logical lines of code, LLoC)
- Pass@k (success probability metric)
Datasets
- LLM Merging (NeurIPS 2024 competition)
- Backdoor Trigger Recovery (NeurIPS 2024)
- Temporal Action Localization (ECCV 2024 Perception Test)
- Rainfall Prediction / Weather4cast (NeurIPS 2022/2023)
- Machine Unlearning (NeurIPS Unlearning 2023)
- Next Product Recommendation (KDD Cup 2023)
- Cross-Domain Meta-Learning (NeurIPS 2022)
Benchmarks
- MLRC-BENCH (this paper)

