Overview
The pipeline combines existing building blocks (RAG, outline-driven generation, LLM ensembles) into a practical system with strong empirical gains in speed and near-human citation/content quality on tested topics.
Citations4
Evidence Strength0.80
Confidence0.78
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 7/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
AutoSurvey turns long, costly survey writing into fast, repeatable drafts that are almost human-quality for coverage and relevance, letting teams scan and document literature rapidly and cheaply.
Who Should Care
Summary TLDR
AutoSurvey is a pipeline that combines retrieval, parallel LLM drafting, and multi-LLM evaluation to automatically generate long literature surveys (up to 64k tokens). It uses iterative Retrieval-Augmented Generation (RAG) to fetch up-to-date papers, generates an outline, drafts subsections in parallel, refines and checks citations, and ranks candidates with a Multi-LLM-as-Judge system calibrated by humans. On 20 LLM-related topics it matches or approaches human citation and content quality while running tens of surveys per hour (e.g., 73.6 surveys/hour for a 64k-token output) at a low token cost; main failure modes are citation overgeneralization and occasional misalignment. (See Table 2, 3
Problem Statement
Writing up-to-date, comprehensive surveys is slow and hard because LLMs have limited output windows, may hallucinate or lack the newest papers, and there is no scalable automatic evaluation that matches human judgment.
Main Contribution
A practical pipeline (AutoSurvey) that blends retrieval, outline-driven parallel drafting, refinement, and multi-LLM judging to produce long surveys.
A retrieval/real-time update strategy so generated surveys cite recent papers and reduce hallucinated references.
Key Findings
AutoSurvey is far faster than humans and naive RAG for long surveys.
Citation quality approaches human levels and improves over naive RAG.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Speed (64k-token surveys) | 73.59 surveys/hour | human 0.07; naive RAG 12.56 | AutoSurvey >> baselines | 20 LLM topics | Measured end-to-end API time per method | Table 2 |
| Citation Recall (64k) | 82.25% | naive RAG 68.79%, human 86.33% | +13.46 vs naive RAG | 20 LLM topics | Citation recall computed via NLI-based support check | Table 2 |
What To Try In 7 Days
Run AutoSurvey repo on one target topic to generate an 8k–32k draft and inspect citations.
Integrate an embedding-based RAG step (nomic-embed-text-v1.5) for your internal paper database to keep reviews current.
Use a small LLM ensemble as an automated quality filter for citation recall/precision before human review.
Agent Features
Memory
Planning
Tool Use
Frameworks
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Citation errors remain: overgeneralization is the largest issue (51% of sampled failures).
System depends on the retrieval database; paywalled or non-indexed papers are missed.
When Not To Use
When you need a flawless, peer-reviewed survey without any human check.
For topics where key sources are behind paywalls or absent from the retrieval corpus.
Failure Modes
Overgeneralization: claims extend beyond what cited sources support (majority of errors).
Misalignment: irrelevant citations that are loosely related but do not support claims.

