AI Scientists can generate ideas but routinely fail to implement and verify them

Overview

Decision SnapshotNeeds Validation

Evidence is consistent across diverse benchmarks and a simulated peer review, but sample sizes and selection bias limit certainty; treat conclusions as strong for current systems but evolving with future work.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 100%

Novelty: 60%

Authors

Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

AI-generated research outputs are often non-executable or non-reproducible; businesses using "AI Scientists" must invest in execution pipelines, human verification, and testing to avoid wasted spend and flawed products.

Who Should Care

CTO Product Manager ML Engineer Founder Data Scientist Engineering Lead

Summary TLDR

This position paper argues that the main limitation of current "AI Scientist" systems is implementation: models generate novel ideas but fail to execute, verify, and iterate experiments reliably. Evidence across multiple benchmarks shows execution rates often under 50% (e.g., PaperBench execution 1.8% for Claude 3.5 Sonnet, SciReplicate best 39%). A simulated peer review of 28 AI-generated papers found experimental weaknesses in 100% of samples. The authors call for focused work on planning, execution tools, verification protocols, and evaluation frameworks.

Problem Statement

AI Scientist systems can produce novel research ideas but lack the engineering and verification skills to turn ideas into correct, reproducible experiments. Benchmarks and an LLM-based peer review of 28 AI-generated papers show persistent failures in code execution, debugging, and experimental validation, limiting scientific value.

Main Contribution

Define an AI Scientist as an end-to-end system that must both generate ideas and execute verification procedures.

Aggregate benchmark evidence showing low execution/replication rates for modern LLMs on real research tasks.

Key Findings

Top LLMs fail basic experiment execution on real-paper replication tasks.

NumbersPaperBench execution 1.8% (Claude 3.5 Sonnet)

Practical UseDo not trust current AI Scientists to autonomously run and verify research code; add human-run execution checks and test harnesses.

Evidence RefSection 3.2 / Table 1 and text

Code-generation agents often produce non-functional code in research reproduction tasks.

NumbersSciReplicate best execution accuracy 39%

Practical UseExpect frequent runtime failures; integrate automated execution tests and continuous debugging loops when using generated code.

Evidence RefSection 3.2 / SciReplicate-Bench

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PaperBench execution (replication 'Execution' rubric)	1.8%	—	—	PaperBench	Claude 3.5 Sonnet scored 1.8% on 'Execution' node	Section 3.2
Accuracy	39%	—	—	SciReplicate-Bench	Best agent achieved 39% execution accuracy on reproducing NLP algorithms	Section 3.2

What To Try In 7 Days

Run a reproducibility smoke test for recent AI outputs: execute generated code on a sandbox.

Add unit and integration tests to any AI-generated code pipeline.

Label AI-generated reports and require human sign-off before release or publication.

Agent Features

Memory

short-term memory limitationslong-context degradation

Planning

long-horizon planning (weak)hierarchical planning suggested

Tool Use

external API callscode execution environmentslab/autonomous tools (discussed)

Frameworks

RAGMCPA2APASA

Is Agentic

Yes

Architectures

LLM-based agent pipelinemulti-agent planner-controller setups

Collaboration

multi-agent coordinationhuman-in-the-loop workflows

Optimization Features

Token Efficiency

use RAG to reduce context burden

Infra Optimization

estimate high wall-clock cost for end-to-end scientist loops

System Optimization

structured workflows and modular agents

Training Optimization

simulate environment to speed RL sampling

Inference Optimization

batch generation to improve sample throughput (discussed)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ResearAI/Awesome-AI-Scientist

Risks & Boundaries

Limitations

Weak long-range reasoning and context retention for multi-step experiments.

Poor strategic planning for large codebases and long-horizon research.

When Not To Use

In safety-sensitive or dual-use research without strict human oversight.

As a sole replacement for human experimental verification.

Failure Modes

Generated code compiles but fails at runtime or produces wrong results.

Inability to debug or iteratively improve experiments.

Core Entities

Models

Claude 3.5 SonnetGPT-4oo1-higho3o4-miniDeepReviewer-14BDeepseek-R1

Metrics

Accuracyreplication scorepass@1success ratepeer-review rating

Benchmarks

PaperBenchSciReplicate-BenchCORE-BenchMLE-BenchML-Dev-BenchLiveCodeBenchHumanEvalScienceAgentBench

Context Entities

Models

AlphaFold (example of successful automated tool)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top LLMs fail basic experiment execution on real-paper replication tasks.

Code-generation agents often produce non-functional code in research reproduction tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

Context Entities

Models

You May Also Want to Read

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

MLRC-Bench: a competition-based benchmark that tests if LLM agents can propose and implement novel ML research

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

BackdoorAgent: a stage-aware framework and benchmark showing memory backdoors persist across multi-step LLM agents

Key finding