AutoSurvey: use retrieval and parallel LLMs to auto-write long, citation-backed surveys

Overview

Decision SnapshotReady For Pilot

The pipeline combines existing building blocks (RAG, outline-driven generation, LLM ensembles) into a practical system with strong empirical gains in speed and near-human citation/content quality on tested topics.

Citations4

Evidence Strength0.80

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 7/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, Wei Ye, Shikun Zhang, Yue Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AutoSurvey turns long, costly survey writing into fast, repeatable drafts that are almost human-quality for coverage and relevance, letting teams scan and document literature rapidly and cheaply.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

AutoSurvey is a pipeline that combines retrieval, parallel LLM drafting, and multi-LLM evaluation to automatically generate long literature surveys (up to 64k tokens). It uses iterative Retrieval-Augmented Generation (RAG) to fetch up-to-date papers, generates an outline, drafts subsections in parallel, refines and checks citations, and ranks candidates with a Multi-LLM-as-Judge system calibrated by humans. On 20 LLM-related topics it matches or approaches human citation and content quality while running tens of surveys per hour (e.g., 73.6 surveys/hour for a 64k-token output) at a low token cost; main failure modes are citation overgeneralization and occasional misalignment. (See Table 2, 3

Problem Statement

Writing up-to-date, comprehensive surveys is slow and hard because LLMs have limited output windows, may hallucinate or lack the newest papers, and there is no scalable automatic evaluation that matches human judgment.

Main Contribution

A practical pipeline (AutoSurvey) that blends retrieval, outline-driven parallel drafting, refinement, and multi-LLM judging to produce long surveys.

A retrieval/real-time update strategy so generated surveys cite recent papers and reduce hallucinated references.

Key Findings

AutoSurvey is far faster than humans and naive RAG for long surveys.

Numbers64k-token speed: AutoSurvey 73.59 vs human 0.07 and naive RAG 12.56 (surveys/hour)

Practical UseUse AutoSurvey to prototype and iterate long literature reviews quickly; it reduces time from many human-hours to minutes per topic.

Evidence RefTable 2

Citation quality approaches human levels and improves over naive RAG.

Numbers64k-token citation recall/precision: AutoSurvey 82.25 / 77.41 vs naive RAG 68.79 / 61.97 (percent)

Practical UseRAG plus targeted citation checking meaningfully reduces irrelevant or unsupported citations versus naive retrieval.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Speed (64k-token surveys)	73.59 surveys/hour	human 0.07; naive RAG 12.56	AutoSurvey >> baselines	20 LLM topics	Measured end-to-end API time per method	Table 2
Citation Recall (64k)	82.25%	naive RAG 68.79%, human 86.33%	+13.46 vs naive RAG	20 LLM topics	Citation recall computed via NLI-based support check	Table 2

What To Try In 7 Days

Run AutoSurvey repo on one target topic to generate an 8k–32k draft and inspect citations.

Integrate an embedding-based RAG step (nomic-embed-text-v1.5) for your internal paper database to keep reviews current.

Use a small LLM ensemble as an automated quality filter for citation recall/precision before human review.

Agent Features

Memory

retrieval memory (paper embeddings)

Planning

initial retrieval and outline generationsubsection draftingintegration and refinementevaluation and iteration

Tool Use

embedding retrievalRAGmulti-LLM evaluation

Frameworks

AutoSurvey

Architectures

outline-driven pipelineparallel LLM workers for subsections

Collaboration

parallel multi-LLM draftingLLM ensemble voting for evaluation

Optimization Features

Token Efficiency

outline-guided chunking reduces redundant context

System Optimization

parallel subsection generation to speed up end-to-end time

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AutoSurveys/AutoSurvey

Data URLs

https://arxiv.org

Risks & Boundaries

Limitations

Citation errors remain: overgeneralization is the largest issue (51% of sampled failures).

System depends on the retrieval database; paywalled or non-indexed papers are missed.

When Not To Use

When you need a flawless, peer-reviewed survey without any human check.

For topics where key sources are behind paywalls or absent from the retrieval corpus.

Failure Modes

Overgeneralization: claims extend beyond what cited sources support (majority of errors).

Misalignment: irrelevant citations that are loosely related but do not support claims.

Core Entities

Models

Claude-3-HaikuGPT-4Gemini-1.5-Pro

Metrics

Citation RecallCitation PrecisionCoverage (5-pt)Structure (5-pt)Relevance (5-pt)Spearman's rho

Datasets

arXiv corpus (530k computer-science papers)

Benchmarks

AutoSurvey evaluation (citation & content quality metrics)

Context Entities

Models

Claude-haiku (writer baseline)GPT-4 (evaluator/writer)Gemini-1.5-pro (evaluator)

Metrics

Speed (surveys/hour)Accuracy

Datasets

Selected 20 human-written surveys (for comparisons and meta-eval)

Benchmarks

Human expert rankings (meta-evaluation)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AutoSurvey is far faster than humans and naive RAG for long surveys.

Citation quality approaches human levels and improves over naive RAG.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding