AutoSurvey: use retrieval and parallel LLMs to auto-write long, citation-backed surveys

June 10, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

4

Authors

Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Min Zhang, Qingsong Wen, Wei Ye, Shikun Zhang, Yue Zhang

Links

Abstract / PDF

Why It Matters For Business

AutoSurvey turns long, costly survey writing into fast, repeatable drafts that are almost human-quality for coverage and relevance, letting teams scan and document literature rapidly and cheaply.

Summary TLDR

AutoSurvey is a pipeline that combines retrieval, parallel LLM drafting, and multi-LLM evaluation to automatically generate long literature surveys (up to 64k tokens). It uses iterative Retrieval-Augmented Generation (RAG) to fetch up-to-date papers, generates an outline, drafts subsections in parallel, refines and checks citations, and ranks candidates with a Multi-LLM-as-Judge system calibrated by humans. On 20 LLM-related topics it matches or approaches human citation and content quality while running tens of surveys per hour (e.g., 73.6 surveys/hour for a 64k-token output) at a low token cost; main failure modes are citation overgeneralization and occasional misalignment. (See Table 2, 3

Problem Statement

Writing up-to-date, comprehensive surveys is slow and hard because LLMs have limited output windows, may hallucinate or lack the newest papers, and there is no scalable automatic evaluation that matches human judgment.

Main Contribution

A practical pipeline (AutoSurvey) that blends retrieval, outline-driven parallel drafting, refinement, and multi-LLM judging to produce long surveys.

A retrieval/real-time update strategy so generated surveys cite recent papers and reduce hallucinated references.

A Multi-LLM-as-Judge evaluation calibrated by humans to score citation quality and content quality automatically.

Thorough experiments showing speed and quality trade-offs versus naive RAG and human-authored surveys, plus open-source code and prompts.

Key Findings

AutoSurvey is far faster than humans and naive RAG for long surveys.

Numbers64k-token speed: AutoSurvey 73.59 vs human 0.07 and naive RAG 12.56 (surveys/hour)

Citation quality approaches human levels and improves over naive RAG.

Numbers64k-token citation recall/precision: AutoSurvey 82.25 / 77.41 vs naive RAG 68.79 / 61.97 (percent)

Content quality (coverage, structure, relevance) is close to human-written surveys.

Numbers64k-token content: AutoSurvey coverage 4.73, structure 4.33, relevance 4.86 (5-point scale); human: 5.00, 4.66, 5.00

Removing retrieval greatly harms citation accuracy.

NumbersAblation w/o retrieval: recall 60.11, precision 51.65 (percent)

Generated surveys can improve downstream model knowledge.

NumbersMultiple-choice accuracy: AutoSurvey 67.60% vs direct 58.40% and naive RAG 65.20%

Automated judging correlates moderately with human rankings.

NumbersSpearman's rho: mixture of LLMs 0.5429

Results

Speed (64k-token surveys)

Value73.59 surveys/hour

Baselinehuman 0.07; naive RAG 12.56

Citation Recall (64k)

Value82.25%

Baselinenaive RAG 68.79%, human 86.33%

Citation Precision (64k)

Value77.41%

Baselinenaive RAG 61.97%, human 77.78%

Content Quality (64k avg scores)

ValueCoverage 4.73, Structure 4.33, Relevance 4.86 (5-pt)

Baselinehuman 5.00, 4.66, 5.00

Ablation: no retrieval

ValueRecall 60.11%, Precision 51.65%

BaselineAutoSurvey with retrieval Recall 83.48%, Precision 77.15%

Accuracy

Value67.60%

BaselineDirect 58.40%, naive RAG 65.20%, upper-bound retrieval 73.60%

Meta-eval correlation

ValueSpearman rho 0.5429 (mixture)

Baselinerho >0.5 considered strong

Who Should Care

What To Try In 7 Days

Run AutoSurvey repo on one target topic to generate an 8k–32k draft and inspect citations.

Integrate an embedding-based RAG step (nomic-embed-text-v1.5) for your internal paper database to keep reviews current.

Use a small LLM ensemble as an automated quality filter for citation recall/precision before human review.

Agent Features

Memory

  • retrieval memory (paper embeddings)

Planning

  • initial retrieval and outline generation
  • subsection drafting
  • integration and refinement
  • evaluation and iteration

Tool Use

  • embedding retrieval
  • RAG
  • multi-LLM evaluation

Frameworks

  • AutoSurvey

Architectures

  • outline-driven pipeline
  • parallel LLM workers for subsections

Collaboration

  • parallel multi-LLM drafting
  • LLM ensemble voting for evaluation

Optimization Features

Token Efficiency

  • outline-guided chunking reduces redundant context

System Optimization

  • parallel subsection generation to speed up end-to-end time

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Citation errors remain: overgeneralization is the largest issue (51% of sampled failures).
  • System depends on the retrieval database; paywalled or non-indexed papers are missed.
  • Structure sometimes lags human writing; final human polishing is recommended for publication-grade surveys.

When Not To Use

  • When you need a flawless, peer-reviewed survey without any human check.
  • For topics where key sources are behind paywalls or absent from the retrieval corpus.
  • If you require guaranteed formal proofs, legal claims, or regulatory-compliant citations.

Failure Modes

  • Overgeneralization: claims extend beyond what cited sources support (majority of errors).
  • Misalignment: irrelevant citations that are loosely related but do not support claims.
  • Misinterpretation: small fraction where sources are read incorrectly.
  • Bias toward papers in the retrieval corpus; misses non-indexed literature.

Core Entities

Models

  • Claude-3-Haiku
  • GPT-4
  • Gemini-1.5-Pro

Metrics

  • Citation Recall
  • Citation Precision
  • Coverage (5-pt)
  • Structure (5-pt)
  • Relevance (5-pt)
  • Spearman's rho

Datasets

  • arXiv corpus (530k computer-science papers)

Benchmarks

  • AutoSurvey evaluation (citation & content quality metrics)

Context Entities

Models

  • Claude-haiku (writer baseline)
  • GPT-4 (evaluator/writer)
  • Gemini-1.5-pro (evaluator)

Metrics

  • Speed (surveys/hour)
  • Accuracy

Datasets

  • Selected 20 human-written surveys (for comparisons and meta-eval)

Benchmarks

  • Human expert rankings (meta-evaluation)