Survey: How LLMs are being used across the full scientific research cycle

January 8, 20257 min

Overview

Decision SnapshotNeeds Validation

The survey documents practical systems and benchmarks but also highlights substantial reliability gaps (hallucinations, weak validity checks, domain limits); pilots and supervised deployments are practical now, full automation is premature.

Citations5

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 70%

Authors

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs can speed idea generation, automate parts of experiment workflows, and draft or pre-check manuscripts—reducing time-to-insight but requiring verification steps to avoid costly mistakes.

Who Should Care

Summary TLDR

This is a focused survey that maps how large language models (LLMs) are used at four stages of science: hypothesis discovery, experiment planning & execution, paper writing, and peer review. The paper catalogs methods, datasets, benchmarks, and evaluation choices; highlights practical system patterns (agents, feedback loops, retrieval grounding); and lists key risks (hallucination, evaluation blind spots, reproducibility limits). The authors provide a resource repo for tools and datasets.

Problem Statement

Researchers lack a single, organized view of how LLMs are applied across the whole research workflow. This survey collects methods, benchmarks, evaluation practices, and gaps so practitioners can compare approaches and identify open problems in automated hypothesis generation, experiment planning, scientific writing, and peer review.

Main Contribution

A structured review of LLM applications across four research stages: discovery, experiment, writing, and review

A taxonomy of method components (e.g., inspiration retrieval, novelty/validity/clarity feedback, evolutionary search)

Key Findings

LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.

Practical UseTreat LLMs as workflow tools: pick the stage you want to accelerate (idea, lab planning, writing, or review) and adopt the task-specific methods and benchmarks discussed.

Evidence RefIntroduction; Fig.1; §2–§5

Literature- and data-driven discovery benchmarks exist; DiscoveryBench contains 264 real discovery tasks plus 903 synthetic tasks.

NumbersDiscoveryBench: 264 real + 903 synthetic tasks

Practical UseIf you want to evaluate hypothesis-generation systems, start with DiscoveryBench to test data-driven discovery performance.

Evidence Ref§2.4.2 (DiscoveryBench [108])

What To Try In 7 Days

Pilot an LLM-assisted literature digest: connect a retriever + LLM to summarize recent papers in your area and flag novel claims.

Use an LLM to generate and rank a short list of experimental plans for one project, then add human validity checks.

Add an LLM-based reviewer assistant to your internal pre-submission checklist to catch missing citations and simple inconsistencies.

Agent Features

Memory
retrieval memory (external docs)short-term context windows
Planning
task decompositionchain-of-thought promptsiterative reflection and self-refine
Tool Use
retrievers (RAG)domain-specific tool suites (chemistry tools)lab automation integration
Frameworks
AutoGenHuggingGPTAgentBenchCycleResearcher
Is Agentic

Yes

Architectures
single-model promptingmulti-agent (specialized agents)modular agent-controller
Collaboration
multi-agent coordinationhuman-in-the-loop validation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Survey focuses on LLMs for research, not full ‘AI for Science’ breadth

Benchmarks often depend on small expert annotations and may not scale

When Not To Use

When you need experimentally validated hypotheses without follow-up verification

As a sole decision-maker for peer review or high-stakes editorial judgments

Failure Modes

Hallucinated or unsupported hypotheses presented as facts

Over-reliance on LLM internal knowledge causing benchmark/data leakage issues

Core Entities

Models

GPT-4LLaMAESM-1bESM-2

Metrics

noveltyvalidity/feasibilityclarityROUGEBLEUBERTScoreMAUVEhuman evaluation (expert ratings)

Datasets

DiscoveryBenchDiscoveryWorldSciMONDiscoveryBench (264+903)SciGenSciXGenALCECiteBench

Benchmarks

DiscoveryBenchDiscoveryWorldTaskBenchMLAgentBenchAgentBenchMLE-BenchSciGenSciXGenALEC/ALCECiteBenchPeerReadMOPRDNLPeer

Context Entities

Models

ChatGPTCodexOpenFoldAlphaFold

Metrics

acceptance predictioncoverage & specificitysemantic similaritycoherence & relevancediversity & specificity

Datasets

S2ORCAANSciSummNetCORWAASQAELI5

Benchmarks

DiscoveryBenchScienceAgentBenchMLAgentBenchLAB-BenchDSBenchCORE-Bench