AI Scientists can generate ideas but routinely fail to implement and verify them

June 2, 20257 min

Overview

Production Readiness

1

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Minjun Zhu, Qiujie Xie, Yixuan Weng, Jian Wu, Zhen Lin, Linyi Yang, Yue Zhang

Links

Abstract / PDF

Why It Matters For Business

AI-generated research outputs are often non-executable or non-reproducible; businesses using "AI Scientists" must invest in execution pipelines, human verification, and testing to avoid wasted spend and flawed products.

Summary TLDR

This position paper argues that the main limitation of current "AI Scientist" systems is implementation: models generate novel ideas but fail to execute, verify, and iterate experiments reliably. Evidence across multiple benchmarks shows execution rates often under 50% (e.g., PaperBench execution 1.8% for Claude 3.5 Sonnet, SciReplicate best 39%). A simulated peer review of 28 AI-generated papers found experimental weaknesses in 100% of samples. The authors call for focused work on planning, execution tools, verification protocols, and evaluation frameworks.

Problem Statement

AI Scientist systems can produce novel research ideas but lack the engineering and verification skills to turn ideas into correct, reproducible experiments. Benchmarks and an LLM-based peer review of 28 AI-generated papers show persistent failures in code execution, debugging, and experimental validation, limiting scientific value.

Main Contribution

Define an AI Scientist as an end-to-end system that must both generate ideas and execute verification procedures.

Aggregate benchmark evidence showing low execution/replication rates for modern LLMs on real research tasks.

Systematic simulated peer review of 28 AI-generated papers revealing consistent experimental and methodological defects.

A focused discussion of four root causes and practical directions to close the implementation gap.

Key Findings

Top LLMs fail basic experiment execution on real-paper replication tasks.

NumbersPaperBench execution 1.8% (Claude 3.5 Sonnet)

Code-generation agents often produce non-functional code in research reproduction tasks.

NumbersSciReplicate best execution accuracy 39%

AI-generated research papers show pervasive experimental and methodological flaws.

Numbers28/28 papers (100%) had experimental weakness in DeepReviewer-14B assessment

Results

PaperBench execution (replication 'Execution' rubric)

Value1.8%

Accuracy

Value39%

CORE-Bench medium reproduction score (example)

Value55.56%

DeepReviewer-14B avg rating (best system sample)

Value4.63

Baselinerating scale up to 10 (6 acceptable)

Who Should Care

What To Try In 7 Days

Run a reproducibility smoke test for recent AI outputs: execute generated code on a sandbox.

Add unit and integration tests to any AI-generated code pipeline.

Label AI-generated reports and require human sign-off before release or publication.

Agent Features

Memory

  • short-term memory limitations
  • long-context degradation

Planning

  • long-horizon planning (weak)
  • hierarchical planning suggested

Tool Use

  • external API calls
  • code execution environments
  • lab/autonomous tools (discussed)

Frameworks

  • RAG
  • MCP
  • A2A
  • PASA

Is Agentic

true

Architectures

  • LLM-based agent pipeline
  • multi-agent planner-controller setups

Collaboration

  • multi-agent coordination
  • human-in-the-loop workflows

Optimization Features

Token Efficiency

  • use RAG to reduce context burden

Infra Optimization

  • estimate high wall-clock cost for end-to-end scientist loops

System Optimization

  • structured workflows and modular agents

Training Optimization

  • simulate environment to speed RL sampling

Inference Optimization

  • batch generation to improve sample throughput (discussed)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Weak long-range reasoning and context retention for multi-step experiments.
  • Poor strategic planning for large codebases and long-horizon research.
  • Fragile multi-agent/tool coordination and API handling.
  • Lack of holistic benchmarks that evaluate end-to-end idea-to-verified-result workflows.

When Not To Use

  • In safety-sensitive or dual-use research without strict human oversight.
  • As a sole replacement for human experimental verification.
  • When reproducible, production-grade code is required immediately.

Failure Modes

  • Generated code compiles but fails at runtime or produces wrong results.
  • Inability to debug or iteratively improve experiments.
  • Overstated novelty or unsupported theoretical claims.
  • Undetected ethical or safety risks in generated research.

Core Entities

Models

  • Claude 3.5 Sonnet
  • GPT-4o
  • o1-high
  • o3
  • o4-mini
  • DeepReviewer-14B
  • Deepseek-R1

Metrics

  • Accuracy
  • replication score
  • pass@1
  • success rate
  • peer-review rating

Benchmarks

  • PaperBench
  • SciReplicate-Bench
  • CORE-Bench
  • MLE-Bench
  • ML-Dev-Bench
  • LiveCodeBench
  • HumanEval
  • ScienceAgentBench

Context Entities

Models

  • AlphaFold (example of successful automated tool)