Overview
Evidence is consistent across diverse benchmarks and a simulated peer review, but sample sizes and selection bias limit certainty; treat conclusions as strong for current systems but evolving with future work.
Citations1
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 0/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 100%
Novelty: 60%
Why It Matters For Business
AI-generated research outputs are often non-executable or non-reproducible; businesses using "AI Scientists" must invest in execution pipelines, human verification, and testing to avoid wasted spend and flawed products.
Who Should Care
Summary TLDR
This position paper argues that the main limitation of current "AI Scientist" systems is implementation: models generate novel ideas but fail to execute, verify, and iterate experiments reliably. Evidence across multiple benchmarks shows execution rates often under 50% (e.g., PaperBench execution 1.8% for Claude 3.5 Sonnet, SciReplicate best 39%). A simulated peer review of 28 AI-generated papers found experimental weaknesses in 100% of samples. The authors call for focused work on planning, execution tools, verification protocols, and evaluation frameworks.
Problem Statement
AI Scientist systems can produce novel research ideas but lack the engineering and verification skills to turn ideas into correct, reproducible experiments. Benchmarks and an LLM-based peer review of 28 AI-generated papers show persistent failures in code execution, debugging, and experimental validation, limiting scientific value.
Main Contribution
Define an AI Scientist as an end-to-end system that must both generate ideas and execute verification procedures.
Aggregate benchmark evidence showing low execution/replication rates for modern LLMs on real research tasks.
Key Findings
Top LLMs fail basic experiment execution on real-paper replication tasks.
Code-generation agents often produce non-functional code in research reproduction tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| PaperBench execution (replication 'Execution' rubric) | 1.8% | — | — | PaperBench | Claude 3.5 Sonnet scored 1.8% on 'Execution' node | Section 3.2 |
| Accuracy | 39% | — | — | SciReplicate-Bench | Best agent achieved 39% execution accuracy on reproducing NLP algorithms | Section 3.2 |
What To Try In 7 Days
Run a reproducibility smoke test for recent AI outputs: execute generated code on a sandbox.
Add unit and integration tests to any AI-generated code pipeline.
Label AI-generated reports and require human sign-off before release or publication.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Weak long-range reasoning and context retention for multi-step experiments.
Poor strategic planning for large codebases and long-horizon research.
When Not To Use
In safety-sensitive or dual-use research without strict human oversight.
As a sole replacement for human experimental verification.
Failure Modes
Generated code compiles but fails at runtime or produces wrong results.
Inability to debug or iteratively improve experiments.

