Use RL over editable outlines to plan and draft long scientific texts with better structure and citation fidelity

January 14, 20267 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and concrete. Experiments on a focused survey task and multiple models support claims, but broader tasks and human expert evaluation are missing, so real-world readiness is moderate.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yilin Bao, Ziyao He, Zayden Yang

Links

Abstract / PDF

Why It Matters For Business

Modeling writing as iterative edit planning improves global structure and citation fidelity. This can cut human editing time and make automated drafting more controllable for long documents.

Who Should Care

Summary TLDR

The paper turns scientific writing into an iterative planning problem over editable outlines. It trains a policy with two-stage optimization—backward outline reconstruction and forward value-guided RL—to improve global structure, information coverage, and citation consistency. A new arXiv-derived benchmark (1,500 papers) and experiments (200–300 step budgets) show better structural coherence and citation reliability than several one-shot baselines on a survey generation task.

Problem Statement

Large LLMs write locally fluent text but struggle to plan long documents, cover input materials, and keep citations faithful. The authors reformulate writing as sequential edits on explicit outline states to enable better long-horizon credit assignment and controllable optimization.

Main Contribution

A state-action formulation that models paper generation as iterative edits over a hierarchical outline (diff-based actions).

A two-stage training recipe: (1) backward outline reconstruction from partial plans; (2) forward value-guided RL with rewards for structure, factuality, and citation fidelity.

Key Findings

Fine-tuned models using the OutlineForge pipeline beat one-shot outline baselines on evaluated survey generation.

NumbersPhi-3.8B F1=0.422 vs SurveyForge F1≈0.313 (200 steps)

Practical UseFine-tuning a mid-sized model on edit-trajectory data can materially raise overall F1 on structured survey writing compared to one-pass outline systems.

Evidence RefTable 1

Citation retrieval gains saturate early in the iterative process.

NumbersRetrieved references plateau within 50150 steps

Practical UseAllocate retrieval budget to early editing steps; later steps give diminishing returns for finding new citations and are better used for refinement.

Evidence RefFigure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F1 (survey generation, 200 steps)0.422 ± 0.117 (Phi-3.8B, finetuned)0.313 ± 0.129 (SurveyForge, reported)+0.109OutlineForge survey benchmark (arXiv subset)Table 1 shows Phi-3.8B finetuned improves F1 vs SurveyForge at 200 stepsTable 1
F1 (survey generation, 200 steps)0.352 ± 0.083 (GPT-4o-mini, Ours)0.285 ± 0.083 (SurveyForge, reported)+0.067OutlineForge survey benchmark (arXiv subset)Table 1 lists GPT-4o-mini results vs SurveyForgeTable 1

What To Try In 7 Days

Prototype an outline-edit loop: represent documents as hierarchical outlines and apply simple structural edits.

Fine-tune a mid-size model (≈3–4B) on a small set of outline-to-edit examples to test structural gains.

Limit iterative runs to ~200 steps and focus retrieval early (first 50–150 steps) to save compute and get most citations fast.

Agent Features

Memory
explicit outline state as short-term memoryedit trajectory history for preference learning
Planning
long-horizon planning (200–300 steps)two-stage value-guided optimization
Tool Use
LLM invocation for semantic editsdeterministic structural edits (direct exec)
Frameworks
PPO-style RLpreference alignment from edit transcripts
Is Agentic

Yes

Architectures
hierarchical state-action outline
Collaboration
human editing trajectories used as preference data

Optimization Features

Model Optimization
fine-tuning on outline-edit trajectories
System Optimization
separate direct structural edits from LLM-generated content to save model calls
Training Optimization
backward outline reconstruction (structure consistency)forward value-guided RL with multi-aspect rewards
Inference Optimization
step budgeting (200 or 300 steps)early-focused retrieval to reduce later compute

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Automatic evaluation relies partly on LLM judges, which can introduce bias and mismatch with human experts.

Experiments focus on survey generation; other scholarly genres were not evaluated.

When Not To Use

For short, one-shot generation tasks where global structure is trivial.

When you must produce original experimental results rather than surveys or summaries.

Failure Modes

Early structural mistakes propagate and are costly to revise across many edit steps.

LLM-based automatic judges can mis-evaluate factuality or citation relevance.

Core Entities

Models

Gemma2-2BQwen3-1.8BPhi-3.8BGPT-4o-miniClaude-3.5-HaikuLlama-3.1-Instruct-70B

Metrics

PrecisionRecallF1structural completenesscitation relevanceinformation density

Datasets

arXiv-derived outline-edit dataset (1,500 articles)

Benchmarks

survey generation benchmark (OutlineForge benchmark)