Survey: How LLMs are being used across the full scientific research cycle

Overview

Decision SnapshotNeeds Validation

The survey documents practical systems and benchmarks but also highlights substantial reliability gaps (hallucinations, weak validity checks, domain limits); pilots and supervised deployments are practical now, full automation is premature.

Citations5

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 70%

Authors

Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs can speed idea generation, automate parts of experiment workflows, and draft or pre-check manuscripts—reducing time-to-insight but requiring verification steps to avoid costly mistakes.

Who Should Care

Product Manager ML Engineer Founder CTO Data Scientist

Summary TLDR

This is a focused survey that maps how large language models (LLMs) are used at four stages of science: hypothesis discovery, experiment planning & execution, paper writing, and peer review. The paper catalogs methods, datasets, benchmarks, and evaluation choices; highlights practical system patterns (agents, feedback loops, retrieval grounding); and lists key risks (hallucination, evaluation blind spots, reproducibility limits). The authors provide a resource repo for tools and datasets.

Problem Statement

Researchers lack a single, organized view of how LLMs are applied across the whole research workflow. This survey collects methods, benchmarks, evaluation practices, and gaps so practitioners can compare approaches and identify open problems in automated hypothesis generation, experiment planning, scientific writing, and peer review.

Main Contribution

A structured review of LLM applications across four research stages: discovery, experiment, writing, and review

A taxonomy of method components (e.g., inspiration retrieval, novelty/validity/clarity feedback, evolutionary search)

Key Findings

LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.

Practical UseTreat LLMs as workflow tools: pick the stage you want to accelerate (idea, lab planning, writing, or review) and adopt the task-specific methods and benchmarks discussed.

Evidence RefIntroduction; Fig.1; §2–§5

Literature- and data-driven discovery benchmarks exist; DiscoveryBench contains 264 real discovery tasks plus 903 synthetic tasks.

NumbersDiscoveryBench: 264 real + 903 synthetic tasks

Practical UseIf you want to evaluate hypothesis-generation systems, start with DiscoveryBench to test data-driven discovery performance.

Evidence Ref§2.4.2 (DiscoveryBench [108])

What To Try In 7 Days

Pilot an LLM-assisted literature digest: connect a retriever + LLM to summarize recent papers in your area and flag novel claims.

Use an LLM to generate and rank a short list of experimental plans for one project, then add human validity checks.

Add an LLM-based reviewer assistant to your internal pre-submission checklist to catch missing citations and simple inconsistencies.

Agent Features

Memory

retrieval memory (external docs)short-term context windows

Planning

task decompositionchain-of-thought promptsiterative reflection and self-refine

Tool Use

retrievers (RAG)domain-specific tool suites (chemistry tools)lab automation integration

Frameworks

AutoGenHuggingGPTAgentBenchCycleResearcher

Is Agentic

Yes

Architectures

single-model promptingmulti-agent (specialized agents)modular agent-controller

Collaboration

multi-agent coordinationhuman-in-the-loop validation

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/du-nlp-lab/LLM4SR

Risks & Boundaries

Limitations

Survey focuses on LLMs for research, not full ‘AI for Science’ breadth

Benchmarks often depend on small expert annotations and may not scale

When Not To Use

When you need experimentally validated hypotheses without follow-up verification

As a sole decision-maker for peer review or high-stakes editorial judgments

Failure Modes

Hallucinated or unsupported hypotheses presented as facts

Over-reliance on LLM internal knowledge causing benchmark/data leakage issues

Core Entities

Models

GPT-4LLaMAESM-1bESM-2

Metrics

noveltyvalidity/feasibilityclarityROUGEBLEUBERTScoreMAUVEhuman evaluation (expert ratings)

Datasets

DiscoveryBenchDiscoveryWorldSciMONDiscoveryBench (264+903)SciGenSciXGenALCECiteBench

Benchmarks

DiscoveryBenchDiscoveryWorldTaskBenchMLAgentBenchAgentBenchMLE-BenchSciGenSciXGenALEC/ALCECiteBenchPeerReadMOPRDNLPeer

Context Entities

Models

ChatGPTCodexOpenFoldAlphaFold

Metrics

acceptance predictioncoverage & specificitysemantic similaritycoherence & relevancediversity & specificity

Datasets

S2ORCAANSciSummNetCORWAASQAELI5

Benchmarks

DiscoveryBenchScienceAgentBenchMLAgentBenchLAB-BenchDSBenchCORE-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.

Literature- and data-driven discovery benchmarks exist; DiscoveryBench contains 264 real discovery tasks plus 903 synthetic tasks.

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding