Overview
The survey documents practical systems and benchmarks but also highlights substantial reliability gaps (hallucinations, weak validity checks, domain limits); pilots and supervised deployments are practical now, full automation is premature.
Citations5
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 1/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/0
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 75%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
LLMs can speed idea generation, automate parts of experiment workflows, and draft or pre-check manuscripts—reducing time-to-insight but requiring verification steps to avoid costly mistakes.
Who Should Care
Summary TLDR
This is a focused survey that maps how large language models (LLMs) are used at four stages of science: hypothesis discovery, experiment planning & execution, paper writing, and peer review. The paper catalogs methods, datasets, benchmarks, and evaluation choices; highlights practical system patterns (agents, feedback loops, retrieval grounding); and lists key risks (hallucination, evaluation blind spots, reproducibility limits). The authors provide a resource repo for tools and datasets.
Problem Statement
Researchers lack a single, organized view of how LLMs are applied across the whole research workflow. This survey collects methods, benchmarks, evaluation practices, and gaps so practitioners can compare approaches and identify open problems in automated hypothesis generation, experiment planning, scientific writing, and peer review.
Main Contribution
A structured review of LLM applications across four research stages: discovery, experiment, writing, and review
A taxonomy of method components (e.g., inspiration retrieval, novelty/validity/clarity feedback, evolutionary search)
Key Findings
LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.
Literature- and data-driven discovery benchmarks exist; DiscoveryBench contains 264 real discovery tasks plus 903 synthetic tasks.
What To Try In 7 Days
Pilot an LLM-assisted literature digest: connect a retriever + LLM to summarize recent papers in your area and flag novel claims.
Use an LLM to generate and rank a short list of experimental plans for one project, then add human validity checks.
Add an LLM-based reviewer assistant to your internal pre-submission checklist to catch missing citations and simple inconsistencies.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Survey focuses on LLMs for research, not full ‘AI for Science’ breadth
Benchmarks often depend on small expert annotations and may not scale
When Not To Use
When you need experimentally validated hypotheses without follow-up verification
As a sole decision-maker for peer review or high-stakes editorial judgments
Failure Modes
Hallucinated or unsupported hypotheses presented as facts
Over-reliance on LLM internal knowledge causing benchmark/data leakage issues

