Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Modeling writing as iterative edit planning improves global structure and citation fidelity. This can cut human editing time and make automated drafting more controllable for long documents.
Summary TLDR
The paper turns scientific writing into an iterative planning problem over editable outlines. It trains a policy with two-stage optimization—backward outline reconstruction and forward value-guided RL—to improve global structure, information coverage, and citation consistency. A new arXiv-derived benchmark (1,500 papers) and experiments (200–300 step budgets) show better structural coherence and citation reliability than several one-shot baselines on a survey generation task.
Problem Statement
Large LLMs write locally fluent text but struggle to plan long documents, cover input materials, and keep citations faithful. The authors reformulate writing as sequential edits on explicit outline states to enable better long-horizon credit assignment and controllable optimization.
Main Contribution
A state-action formulation that models paper generation as iterative edits over a hierarchical outline (diff-based actions).
A two-stage training recipe: (1) backward outline reconstruction from partial plans; (2) forward value-guided RL with rewards for structure, factuality, and citation fidelity.
A new benchmarking pipeline built from 1,500 arXiv papers and metrics that measure planning, input coverage, citation faithfulness, and structure.
Key Findings
Fine-tuned models using the OutlineForge pipeline beat one-shot outline baselines on evaluated survey generation.
Citation retrieval gains saturate early in the iterative process.
A practical planning horizon for generating full articles is modest.
Small models (<2B) show limited benefit from the proposed fine-tuning.
Results
F1 (survey generation, 200 steps)
F1 (survey generation, 200 steps)
Citation retrieval saturation
Estimated editing steps to generate article
Who Should Care
What To Try In 7 Days
Prototype an outline-edit loop: represent documents as hierarchical outlines and apply simple structural edits.
Fine-tune a mid-size model (≈3–4B) on a small set of outline-to-edit examples to test structural gains.
Limit iterative runs to ~200 steps and focus retrieval early (first 50–150 steps) to save compute and get most citations fast.
Agent Features
Memory
- explicit outline state as short-term memory
- edit trajectory history for preference learning
Planning
- long-horizon planning (200–300 steps)
- two-stage value-guided optimization
Tool Use
- LLM invocation for semantic edits
- deterministic structural edits (direct exec)
Frameworks
- PPO-style RL
- preference alignment from edit transcripts
Is Agentic
true
Architectures
- hierarchical state-action outline
Collaboration
- human editing trajectories used as preference data
Optimization Features
Model Optimization
- fine-tuning on outline-edit trajectories
System Optimization
- separate direct structural edits from LLM-generated content to save model calls
Training Optimization
- backward outline reconstruction (structure consistency)
- forward value-guided RL with multi-aspect rewards
Inference Optimization
- step budgeting (200 or 300 steps)
- early-focused retrieval to reduce later compute
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Automatic evaluation relies partly on LLM judges, which can introduce bias and mismatch with human experts.
- Experiments focus on survey generation; other scholarly genres were not evaluated.
- State-action schemas are manually designed and may not generalize to very different document formats.
- Iterative editing can accumulate errors if early structural choices are wrong.
When Not To Use
- For short, one-shot generation tasks where global structure is trivial.
- When you must produce original experimental results rather than surveys or summaries.
- If you have only very small models (<2B) with little domain knowledge to leverage.
Failure Modes
- Early structural mistakes propagate and are costly to revise across many edit steps.
- LLM-based automatic judges can mis-evaluate factuality or citation relevance.
- Small models fail to learn useful editing strategies due to limited world knowledge.
Core Entities
Models
- Gemma2-2B
- Qwen3-1.8B
- Phi-3.8B
- GPT-4o-mini
- Claude-3.5-Haiku
- Llama-3.1-Instruct-70B
Metrics
- Precision
- Recall
- F1
- structural completeness
- citation relevance
- information density
Datasets
- arXiv-derived outline-edit dataset (1,500 articles)
Benchmarks
- survey generation benchmark (OutlineForge benchmark)

