Survey: How LLMs are being used across the full scientific research cycle

January 8, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.75

Citation Count

5

Authors

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed idea generation, automate parts of experiment workflows, and draft or pre-check manuscripts—reducing time-to-insight but requiring verification steps to avoid costly mistakes.

Summary TLDR

This is a focused survey that maps how large language models (LLMs) are used at four stages of science: hypothesis discovery, experiment planning & execution, paper writing, and peer review. The paper catalogs methods, datasets, benchmarks, and evaluation choices; highlights practical system patterns (agents, feedback loops, retrieval grounding); and lists key risks (hallucination, evaluation blind spots, reproducibility limits). The authors provide a resource repo for tools and datasets.

Problem Statement

Researchers lack a single, organized view of how LLMs are applied across the whole research workflow. This survey collects methods, benchmarks, evaluation practices, and gaps so practitioners can compare approaches and identify open problems in automated hypothesis generation, experiment planning, scientific writing, and peer review.

Main Contribution

A structured review of LLM applications across four research stages: discovery, experiment, writing, and review

A taxonomy of method components (e.g., inspiration retrieval, novelty/validity/clarity feedback, evolutionary search)

A consolidated list of benchmarks and evaluation trends and concrete gaps and research directions

Key Findings

LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.

Literature- and data-driven discovery benchmarks exist; DiscoveryBench contains 264 real discovery tasks plus 903 synthetic tasks.

NumbersDiscoveryBench: 264 real + 903 synthetic tasks

Human studies find LLMs can produce more novel but slightly less valid hypotheses than humans on evaluated tasks.

Retrieval-augmented generation (RAG) is a common fix for hallucination in writing and related-work tasks.

Peer-review automation splits into fully automated review generation and human-in-the-loop review assistants; existing benchmarks measure semantic similarity, coherence, diversity, and human evaluation.

Who Should Care

What To Try In 7 Days

Pilot an LLM-assisted literature digest: connect a retriever + LLM to summarize recent papers in your area and flag novel claims.

Use an LLM to generate and rank a short list of experimental plans for one project, then add human validity checks.

Add an LLM-based reviewer assistant to your internal pre-submission checklist to catch missing citations and simple inconsistencies.

Agent Features

Memory

  • retrieval memory (external docs)
  • short-term context windows

Planning

  • task decomposition
  • chain-of-thought prompts
  • iterative reflection and self-refine

Tool Use

  • retrievers (RAG)
  • domain-specific tool suites (chemistry tools)
  • lab automation integration

Frameworks

  • AutoGen
  • HuggingGPT
  • AgentBench
  • CycleResearcher

Is Agentic

true

Architectures

  • single-model prompting
  • multi-agent (specialized agents)
  • modular agent-controller

Collaboration

  • multi-agent coordination
  • human-in-the-loop validation

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey focuses on LLMs for research, not full ‘AI for Science’ breadth
  • Benchmarks often depend on small expert annotations and may not scale
  • Reliability of automatic validity checks is weak for lab-heavy disciplines (chemistry, biology)
  • Ethical issues (plagiarism, authorship, homogenized reviews) are unresolved

When Not To Use

  • When you need experimentally validated hypotheses without follow-up verification
  • As a sole decision-maker for peer review or high-stakes editorial judgments
  • For domain-critical lab protocols without human expert oversight

Failure Modes

  • Hallucinated or unsupported hypotheses presented as facts
  • Over-reliance on LLM internal knowledge causing benchmark/data leakage issues
  • Poor prompt robustness causing inconsistent multi-stage plans
  • Homogenization of reviews or ideas when many users rely on the same LLM outputs

Core Entities

Models

  • GPT-4
  • LLaMA
  • ESM-1b
  • ESM-2

Metrics

  • novelty
  • validity/feasibility
  • clarity
  • ROUGE
  • BLEU
  • BERTScore
  • MAUVE
  • human evaluation (expert ratings)

Datasets

  • DiscoveryBench
  • DiscoveryWorld
  • SciMON
  • DiscoveryBench (264+903)
  • SciGen
  • SciXGen
  • ALCE
  • CiteBench

Benchmarks

  • DiscoveryBench
  • DiscoveryWorld
  • TaskBench
  • MLAgentBench
  • AgentBench
  • MLE-Bench
  • SciGen
  • SciXGen
  • ALEC/ALCE
  • CiteBench
  • PeerRead
  • MOPRD
  • NLPeer

Context Entities

Models

  • ChatGPT
  • Codex
  • OpenFold
  • AlphaFold

Metrics

  • acceptance prediction
  • coverage & specificity
  • semantic similarity
  • coherence & relevance
  • diversity & specificity

Datasets

  • S2ORC
  • AAN
  • SciSummNet
  • CORWA
  • ASQA
  • ELI5

Benchmarks

  • DiscoveryBench
  • ScienceAgentBench
  • MLAgentBench
  • LAB-Bench
  • DSBench
  • CORE-Bench