AutoSurvey: use retrieval and parallel LLMs to auto-write long, citation-backed surveys

June 10, 20248 min

Overview

Decision SnapshotReady For Pilot

The pipeline combines existing building blocks (RAG, outline-driven generation, LLM ensembles) into a practical system with strong empirical gains in speed and near-human citation/content quality on tested topics.

Citations4

Evidence Strength0.80

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 7/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, Wei Ye, Shikun Zhang, Yue Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AutoSurvey turns long, costly survey writing into fast, repeatable drafts that are almost human-quality for coverage and relevance, letting teams scan and document literature rapidly and cheaply.

Who Should Care

Summary TLDR

AutoSurvey is a pipeline that combines retrieval, parallel LLM drafting, and multi-LLM evaluation to automatically generate long literature surveys (up to 64k tokens). It uses iterative Retrieval-Augmented Generation (RAG) to fetch up-to-date papers, generates an outline, drafts subsections in parallel, refines and checks citations, and ranks candidates with a Multi-LLM-as-Judge system calibrated by humans. On 20 LLM-related topics it matches or approaches human citation and content quality while running tens of surveys per hour (e.g., 73.6 surveys/hour for a 64k-token output) at a low token cost; main failure modes are citation overgeneralization and occasional misalignment. (See Table 2, 3

Problem Statement

Writing up-to-date, comprehensive surveys is slow and hard because LLMs have limited output windows, may hallucinate or lack the newest papers, and there is no scalable automatic evaluation that matches human judgment.

Main Contribution

A practical pipeline (AutoSurvey) that blends retrieval, outline-driven parallel drafting, refinement, and multi-LLM judging to produce long surveys.

A retrieval/real-time update strategy so generated surveys cite recent papers and reduce hallucinated references.

Key Findings

AutoSurvey is far faster than humans and naive RAG for long surveys.

Numbers64k-token speed: AutoSurvey 73.59 vs human 0.07 and naive RAG 12.56 (surveys/hour)

Practical UseUse AutoSurvey to prototype and iterate long literature reviews quickly; it reduces time from many human-hours to minutes per topic.

Evidence RefTable 2

Citation quality approaches human levels and improves over naive RAG.

Numbers64k-token citation recall/precision: AutoSurvey 82.25 / 77.41 vs naive RAG 68.79 / 61.97 (percent)

Practical UseRAG plus targeted citation checking meaningfully reduces irrelevant or unsupported citations versus naive retrieval.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Speed (64k-token surveys)73.59 surveys/hourhuman 0.07; naive RAG 12.56AutoSurvey >> baselines20 LLM topicsMeasured end-to-end API time per methodTable 2
Citation Recall (64k)82.25%naive RAG 68.79%, human 86.33%+13.46 vs naive RAG20 LLM topicsCitation recall computed via NLI-based support checkTable 2

What To Try In 7 Days

Run AutoSurvey repo on one target topic to generate an 8k–32k draft and inspect citations.

Integrate an embedding-based RAG step (nomic-embed-text-v1.5) for your internal paper database to keep reviews current.

Use a small LLM ensemble as an automated quality filter for citation recall/precision before human review.

Agent Features

Memory
retrieval memory (paper embeddings)
Planning
initial retrieval and outline generationsubsection draftingintegration and refinementevaluation and iteration
Tool Use
embedding retrievalRAGmulti-LLM evaluation
Frameworks
AutoSurvey
Architectures
outline-driven pipelineparallel LLM workers for subsections
Collaboration
parallel multi-LLM draftingLLM ensemble voting for evaluation

Optimization Features

Token Efficiency
outline-guided chunking reduces redundant context
System Optimization
parallel subsection generation to speed up end-to-end time

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Citation errors remain: overgeneralization is the largest issue (51% of sampled failures).

System depends on the retrieval database; paywalled or non-indexed papers are missed.

When Not To Use

When you need a flawless, peer-reviewed survey without any human check.

For topics where key sources are behind paywalls or absent from the retrieval corpus.

Failure Modes

Overgeneralization: claims extend beyond what cited sources support (majority of errors).

Misalignment: irrelevant citations that are loosely related but do not support claims.

Core Entities

Models

Claude-3-HaikuGPT-4Gemini-1.5-Pro

Metrics

Citation RecallCitation PrecisionCoverage (5-pt)Structure (5-pt)Relevance (5-pt)Spearman's rho

Datasets

arXiv corpus (530k computer-science papers)

Benchmarks

AutoSurvey evaluation (citation & content quality metrics)

Context Entities

Models

Claude-haiku (writer baseline)GPT-4 (evaluator/writer)Gemini-1.5-pro (evaluator)

Metrics

Speed (surveys/hour)Accuracy

Datasets

Selected 20 human-written surveys (for comparisons and meta-eval)

Benchmarks

Human expert rankings (meta-evaluation)