Use RL over editable outlines to plan and draft long scientific texts with better structure and citation fidelity

Overview

Decision SnapshotNeeds Validation

The idea is practical and concrete. Experiments on a focused survey task and multiple models support claims, but broader tasks and human expert evaluation are missing, so real-world readiness is moderate.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yilin Bao, Ziyao He, Zayden Yang

Links

Abstract / PDF

Why It Matters For Business

Modeling writing as iterative edit planning improves global structure and citation fidelity. This can cut human editing time and make automated drafting more controllable for long documents.

Who Should Care

ML Engineer Product Manager Data Scientist Founder

Summary TLDR

The paper turns scientific writing into an iterative planning problem over editable outlines. It trains a policy with two-stage optimization—backward outline reconstruction and forward value-guided RL—to improve global structure, information coverage, and citation consistency. A new arXiv-derived benchmark (1,500 papers) and experiments (200–300 step budgets) show better structural coherence and citation reliability than several one-shot baselines on a survey generation task.

Problem Statement

Large LLMs write locally fluent text but struggle to plan long documents, cover input materials, and keep citations faithful. The authors reformulate writing as sequential edits on explicit outline states to enable better long-horizon credit assignment and controllable optimization.

Main Contribution

A state-action formulation that models paper generation as iterative edits over a hierarchical outline (diff-based actions).

A two-stage training recipe: (1) backward outline reconstruction from partial plans; (2) forward value-guided RL with rewards for structure, factuality, and citation fidelity.

Key Findings

Fine-tuned models using the OutlineForge pipeline beat one-shot outline baselines on evaluated survey generation.

NumbersPhi-3.8B F1=0.422 vs SurveyForge F1≈0.313 (200 steps)

Practical UseFine-tuning a mid-sized model on edit-trajectory data can materially raise overall F1 on structured survey writing compared to one-pass outline systems.

Evidence RefTable 1

Citation retrieval gains saturate early in the iterative process.

NumbersRetrieved references plateau within 50–150 steps

Practical UseAllocate retrieval budget to early editing steps; later steps give diminishing returns for finding new citations and are better used for refinement.

Evidence RefFigure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1 (survey generation, 200 steps)	0.422 ± 0.117 (Phi-3.8B, finetuned)	0.313 ± 0.129 (SurveyForge, reported)	+0.109	OutlineForge survey benchmark (arXiv subset)	Table 1 shows Phi-3.8B finetuned improves F1 vs SurveyForge at 200 steps	Table 1
F1 (survey generation, 200 steps)	0.352 ± 0.083 (GPT-4o-mini, Ours)	0.285 ± 0.083 (SurveyForge, reported)	+0.067	OutlineForge survey benchmark (arXiv subset)	Table 1 lists GPT-4o-mini results vs SurveyForge	Table 1

What To Try In 7 Days

Prototype an outline-edit loop: represent documents as hierarchical outlines and apply simple structural edits.

Fine-tune a mid-size model (≈3–4B) on a small set of outline-to-edit examples to test structural gains.

Limit iterative runs to ~200 steps and focus retrieval early (first 50–150 steps) to save compute and get most citations fast.

Agent Features

Memory

explicit outline state as short-term memoryedit trajectory history for preference learning

Planning

long-horizon planning (200–300 steps)two-stage value-guided optimization

Tool Use

LLM invocation for semantic editsdeterministic structural edits (direct exec)

Frameworks

PPO-style RLpreference alignment from edit transcripts

Is Agentic

Yes

Architectures

hierarchical state-action outline

Collaboration

human editing trajectories used as preference data

Optimization Features

Model Optimization

fine-tuning on outline-edit trajectories

System Optimization

separate direct structural edits from LLM-generated content to save model calls

Training Optimization

backward outline reconstruction (structure consistency)forward value-guided RL with multi-aspect rewards

Inference Optimization

step budgeting (200 or 300 steps)early-focused retrieval to reduce later compute

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Automatic evaluation relies partly on LLM judges, which can introduce bias and mismatch with human experts.

Experiments focus on survey generation; other scholarly genres were not evaluated.

When Not To Use

For short, one-shot generation tasks where global structure is trivial.

When you must produce original experimental results rather than surveys or summaries.

Failure Modes

Early structural mistakes propagate and are costly to revise across many edit steps.

LLM-based automatic judges can mis-evaluate factuality or citation relevance.

Core Entities

Models

Gemma2-2BQwen3-1.8BPhi-3.8BGPT-4o-miniClaude-3.5-HaikuLlama-3.1-Instruct-70B

Metrics

PrecisionRecallF1structural completenesscitation relevanceinformation density

Datasets

arXiv-derived outline-edit dataset (1,500 articles)

Benchmarks

survey generation benchmark (OutlineForge benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned models using the OutlineForge pipeline beat one-shot outline baselines on evaluated survey generation.

Citation retrieval gains saturate early in the iterative process.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding