Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.75
Citation Count
5
Why It Matters For Business
LLMs can speed idea generation, automate parts of experiment workflows, and draft or pre-check manuscripts—reducing time-to-insight but requiring verification steps to avoid costly mistakes.
Summary TLDR
This is a focused survey that maps how large language models (LLMs) are used at four stages of science: hypothesis discovery, experiment planning & execution, paper writing, and peer review. The paper catalogs methods, datasets, benchmarks, and evaluation choices; highlights practical system patterns (agents, feedback loops, retrieval grounding); and lists key risks (hallucination, evaluation blind spots, reproducibility limits). The authors provide a resource repo for tools and datasets.
Problem Statement
Researchers lack a single, organized view of how LLMs are applied across the whole research workflow. This survey collects methods, benchmarks, evaluation practices, and gaps so practitioners can compare approaches and identify open problems in automated hypothesis generation, experiment planning, scientific writing, and peer review.
Main Contribution
A structured review of LLM applications across four research stages: discovery, experiment, writing, and review
A taxonomy of method components (e.g., inspiration retrieval, novelty/validity/clarity feedback, evolutionary search)
A consolidated list of benchmarks and evaluation trends and concrete gaps and research directions
Key Findings
LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.
Literature- and data-driven discovery benchmarks exist; DiscoveryBench contains 264 real discovery tasks plus 903 synthetic tasks.
Human studies find LLMs can produce more novel but slightly less valid hypotheses than humans on evaluated tasks.
Retrieval-augmented generation (RAG) is a common fix for hallucination in writing and related-work tasks.
Peer-review automation splits into fully automated review generation and human-in-the-loop review assistants; existing benchmarks measure semantic similarity, coherence, diversity, and human evaluation.
Who Should Care
What To Try In 7 Days
Pilot an LLM-assisted literature digest: connect a retriever + LLM to summarize recent papers in your area and flag novel claims.
Use an LLM to generate and rank a short list of experimental plans for one project, then add human validity checks.
Add an LLM-based reviewer assistant to your internal pre-submission checklist to catch missing citations and simple inconsistencies.
Agent Features
Memory
- retrieval memory (external docs)
- short-term context windows
Planning
- task decomposition
- chain-of-thought prompts
- iterative reflection and self-refine
Tool Use
- retrievers (RAG)
- domain-specific tool suites (chemistry tools)
- lab automation integration
Frameworks
- AutoGen
- HuggingGPT
- AgentBench
- CycleResearcher
Is Agentic
true
Architectures
- single-model prompting
- multi-agent (specialized agents)
- modular agent-controller
Collaboration
- multi-agent coordination
- human-in-the-loop validation
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey focuses on LLMs for research, not full ‘AI for Science’ breadth
- Benchmarks often depend on small expert annotations and may not scale
- Reliability of automatic validity checks is weak for lab-heavy disciplines (chemistry, biology)
- Ethical issues (plagiarism, authorship, homogenized reviews) are unresolved
When Not To Use
- When you need experimentally validated hypotheses without follow-up verification
- As a sole decision-maker for peer review or high-stakes editorial judgments
- For domain-critical lab protocols without human expert oversight
Failure Modes
- Hallucinated or unsupported hypotheses presented as facts
- Over-reliance on LLM internal knowledge causing benchmark/data leakage issues
- Poor prompt robustness causing inconsistent multi-stage plans
- Homogenization of reviews or ideas when many users rely on the same LLM outputs
Core Entities
Models
- GPT-4
- LLaMA
- ESM-1b
- ESM-2
Metrics
- novelty
- validity/feasibility
- clarity
- ROUGE
- BLEU
- BERTScore
- MAUVE
- human evaluation (expert ratings)
Datasets
- DiscoveryBench
- DiscoveryWorld
- SciMON
- DiscoveryBench (264+903)
- SciGen
- SciXGen
- ALCE
- CiteBench
Benchmarks
- DiscoveryBench
- DiscoveryWorld
- TaskBench
- MLAgentBench
- AgentBench
- MLE-Bench
- SciGen
- SciXGen
- ALEC/ALCE
- CiteBench
- PeerRead
- MOPRD
- NLPeer
Context Entities
Models
- ChatGPT
- Codex
- OpenFold
- AlphaFold
Metrics
- acceptance prediction
- coverage & specificity
- semantic similarity
- coherence & relevance
- diversity & specificity
Datasets
- S2ORC
- AAN
- SciSummNet
- CORWA
- ASQA
- ELI5
Benchmarks
- DiscoveryBench
- ScienceAgentBench
- MLAgentBench
- LAB-Bench
- DSBench
- CORE-Bench

