AI Scientists can generate ideas but routinely fail to implement and verify them

June 2, 20257 min

Overview

Decision SnapshotNeeds Validation

Evidence is consistent across diverse benchmarks and a simulated peer review, but sample sizes and selection bias limit certainty; treat conclusions as strong for current systems but evolving with future work.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 100%

Novelty: 60%

Authors

Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

AI-generated research outputs are often non-executable or non-reproducible; businesses using "AI Scientists" must invest in execution pipelines, human verification, and testing to avoid wasted spend and flawed products.

Who Should Care

Summary TLDR

This position paper argues that the main limitation of current "AI Scientist" systems is implementation: models generate novel ideas but fail to execute, verify, and iterate experiments reliably. Evidence across multiple benchmarks shows execution rates often under 50% (e.g., PaperBench execution 1.8% for Claude 3.5 Sonnet, SciReplicate best 39%). A simulated peer review of 28 AI-generated papers found experimental weaknesses in 100% of samples. The authors call for focused work on planning, execution tools, verification protocols, and evaluation frameworks.

Problem Statement

AI Scientist systems can produce novel research ideas but lack the engineering and verification skills to turn ideas into correct, reproducible experiments. Benchmarks and an LLM-based peer review of 28 AI-generated papers show persistent failures in code execution, debugging, and experimental validation, limiting scientific value.

Main Contribution

Define an AI Scientist as an end-to-end system that must both generate ideas and execute verification procedures.

Aggregate benchmark evidence showing low execution/replication rates for modern LLMs on real research tasks.

Key Findings

Top LLMs fail basic experiment execution on real-paper replication tasks.

NumbersPaperBench execution 1.8% (Claude 3.5 Sonnet)

Practical UseDo not trust current AI Scientists to autonomously run and verify research code; add human-run execution checks and test harnesses.

Evidence RefSection 3.2 / Table 1 and text

Code-generation agents often produce non-functional code in research reproduction tasks.

NumbersSciReplicate best execution accuracy 39%

Practical UseExpect frequent runtime failures; integrate automated execution tests and continuous debugging loops when using generated code.

Evidence RefSection 3.2 / SciReplicate-Bench

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PaperBench execution (replication 'Execution' rubric)1.8%PaperBenchClaude 3.5 Sonnet scored 1.8% on 'Execution' nodeSection 3.2
Accuracy39%SciReplicate-BenchBest agent achieved 39% execution accuracy on reproducing NLP algorithmsSection 3.2

What To Try In 7 Days

Run a reproducibility smoke test for recent AI outputs: execute generated code on a sandbox.

Add unit and integration tests to any AI-generated code pipeline.

Label AI-generated reports and require human sign-off before release or publication.

Agent Features

Memory
short-term memory limitationslong-context degradation
Planning
long-horizon planning (weak)hierarchical planning suggested
Tool Use
external API callscode execution environmentslab/autonomous tools (discussed)
Frameworks
RAGMCPA2APASA
Is Agentic

Yes

Architectures
LLM-based agent pipelinemulti-agent planner-controller setups
Collaboration
multi-agent coordinationhuman-in-the-loop workflows

Optimization Features

Token Efficiency
use RAG to reduce context burden
Infra Optimization
estimate high wall-clock cost for end-to-end scientist loops
System Optimization
structured workflows and modular agents
Training Optimization
simulate environment to speed RL sampling
Inference Optimization
batch generation to improve sample throughput (discussed)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Weak long-range reasoning and context retention for multi-step experiments.

Poor strategic planning for large codebases and long-horizon research.

When Not To Use

In safety-sensitive or dual-use research without strict human oversight.

As a sole replacement for human experimental verification.

Failure Modes

Generated code compiles but fails at runtime or produces wrong results.

Inability to debug or iteratively improve experiments.

Core Entities

Models

Claude 3.5 SonnetGPT-4oo1-higho3o4-miniDeepReviewer-14BDeepseek-R1

Metrics

Accuracyreplication scorepass@1success ratepeer-review rating

Benchmarks

PaperBenchSciReplicate-BenchCORE-BenchMLE-BenchML-Dev-BenchLiveCodeBenchHumanEvalScienceAgentBench

Context Entities

Models

AlphaFold (example of successful automated tool)