Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
4
Why It Matters For Business
AutoSurvey turns long, costly survey writing into fast, repeatable drafts that are almost human-quality for coverage and relevance, letting teams scan and document literature rapidly and cheaply.
Summary TLDR
AutoSurvey is a pipeline that combines retrieval, parallel LLM drafting, and multi-LLM evaluation to automatically generate long literature surveys (up to 64k tokens). It uses iterative Retrieval-Augmented Generation (RAG) to fetch up-to-date papers, generates an outline, drafts subsections in parallel, refines and checks citations, and ranks candidates with a Multi-LLM-as-Judge system calibrated by humans. On 20 LLM-related topics it matches or approaches human citation and content quality while running tens of surveys per hour (e.g., 73.6 surveys/hour for a 64k-token output) at a low token cost; main failure modes are citation overgeneralization and occasional misalignment. (See Table 2, 3
Problem Statement
Writing up-to-date, comprehensive surveys is slow and hard because LLMs have limited output windows, may hallucinate or lack the newest papers, and there is no scalable automatic evaluation that matches human judgment.
Main Contribution
A practical pipeline (AutoSurvey) that blends retrieval, outline-driven parallel drafting, refinement, and multi-LLM judging to produce long surveys.
A retrieval/real-time update strategy so generated surveys cite recent papers and reduce hallucinated references.
A Multi-LLM-as-Judge evaluation calibrated by humans to score citation quality and content quality automatically.
Thorough experiments showing speed and quality trade-offs versus naive RAG and human-authored surveys, plus open-source code and prompts.
Key Findings
AutoSurvey is far faster than humans and naive RAG for long surveys.
Citation quality approaches human levels and improves over naive RAG.
Content quality (coverage, structure, relevance) is close to human-written surveys.
Removing retrieval greatly harms citation accuracy.
Generated surveys can improve downstream model knowledge.
Automated judging correlates moderately with human rankings.
Results
Speed (64k-token surveys)
Citation Recall (64k)
Citation Precision (64k)
Content Quality (64k avg scores)
Ablation: no retrieval
Accuracy
Meta-eval correlation
Who Should Care
What To Try In 7 Days
Run AutoSurvey repo on one target topic to generate an 8k–32k draft and inspect citations.
Integrate an embedding-based RAG step (nomic-embed-text-v1.5) for your internal paper database to keep reviews current.
Use a small LLM ensemble as an automated quality filter for citation recall/precision before human review.
Agent Features
Memory
- retrieval memory (paper embeddings)
Planning
- initial retrieval and outline generation
- subsection drafting
- integration and refinement
- evaluation and iteration
Tool Use
- embedding retrieval
- RAG
- multi-LLM evaluation
Frameworks
- AutoSurvey
Architectures
- outline-driven pipeline
- parallel LLM workers for subsections
Collaboration
- parallel multi-LLM drafting
- LLM ensemble voting for evaluation
Optimization Features
Token Efficiency
- outline-guided chunking reduces redundant context
System Optimization
- parallel subsection generation to speed up end-to-end time
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Citation errors remain: overgeneralization is the largest issue (51% of sampled failures).
- System depends on the retrieval database; paywalled or non-indexed papers are missed.
- Structure sometimes lags human writing; final human polishing is recommended for publication-grade surveys.
When Not To Use
- When you need a flawless, peer-reviewed survey without any human check.
- For topics where key sources are behind paywalls or absent from the retrieval corpus.
- If you require guaranteed formal proofs, legal claims, or regulatory-compliant citations.
Failure Modes
- Overgeneralization: claims extend beyond what cited sources support (majority of errors).
- Misalignment: irrelevant citations that are loosely related but do not support claims.
- Misinterpretation: small fraction where sources are read incorrectly.
- Bias toward papers in the retrieval corpus; misses non-indexed literature.
Core Entities
Models
- Claude-3-Haiku
- GPT-4
- Gemini-1.5-Pro
Metrics
- Citation Recall
- Citation Precision
- Coverage (5-pt)
- Structure (5-pt)
- Relevance (5-pt)
- Spearman's rho
Datasets
- arXiv corpus (530k computer-science papers)
Benchmarks
- AutoSurvey evaluation (citation & content quality metrics)
Context Entities
Models
- Claude-haiku (writer baseline)
- GPT-4 (evaluator/writer)
- Gemini-1.5-pro (evaluator)
Metrics
- Speed (surveys/hour)
- Accuracy
Datasets
- Selected 20 human-written surveys (for comparisons and meta-eval)
Benchmarks
- Human expert rankings (meta-evaluation)

