Use RL over editable outlines to plan and draft long scientific texts with better structure and citation fidelity

January 14, 20267 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Yilin Bao, Ziyao He, Zayden Yang

Links

Abstract / PDF

Why It Matters For Business

Modeling writing as iterative edit planning improves global structure and citation fidelity. This can cut human editing time and make automated drafting more controllable for long documents.

Summary TLDR

The paper turns scientific writing into an iterative planning problem over editable outlines. It trains a policy with two-stage optimization—backward outline reconstruction and forward value-guided RL—to improve global structure, information coverage, and citation consistency. A new arXiv-derived benchmark (1,500 papers) and experiments (200–300 step budgets) show better structural coherence and citation reliability than several one-shot baselines on a survey generation task.

Problem Statement

Large LLMs write locally fluent text but struggle to plan long documents, cover input materials, and keep citations faithful. The authors reformulate writing as sequential edits on explicit outline states to enable better long-horizon credit assignment and controllable optimization.

Main Contribution

A state-action formulation that models paper generation as iterative edits over a hierarchical outline (diff-based actions).

A two-stage training recipe: (1) backward outline reconstruction from partial plans; (2) forward value-guided RL with rewards for structure, factuality, and citation fidelity.

A new benchmarking pipeline built from 1,500 arXiv papers and metrics that measure planning, input coverage, citation faithfulness, and structure.

Key Findings

Fine-tuned models using the OutlineForge pipeline beat one-shot outline baselines on evaluated survey generation.

NumbersPhi-3.8B F1=0.422 vs SurveyForge F1≈0.313 (200 steps)

Citation retrieval gains saturate early in the iterative process.

NumbersRetrieved references plateau within 50–150 steps

A practical planning horizon for generating full articles is modest.

NumbersMost articles decompose within 200–300 edit steps

Small models (<2B) show limited benefit from the proposed fine-tuning.

NumbersGemma2-2B and Qwen3-1.8B show small F1 improvements vs larger models

Results

F1 (survey generation, 200 steps)

Value0.422 ± 0.117 (Phi-3.8B, finetuned)

Baseline0.313 ± 0.129 (SurveyForge, reported)

F1 (survey generation, 200 steps)

Value0.352 ± 0.083 (GPT-4o-mini, Ours)

Baseline0.285 ± 0.083 (SurveyForge, reported)

Citation retrieval saturation

ValuePlateau within 50–150 steps

Estimated editing steps to generate article

Value200–300 steps for most domains

Who Should Care

What To Try In 7 Days

Prototype an outline-edit loop: represent documents as hierarchical outlines and apply simple structural edits.

Fine-tune a mid-size model (≈3–4B) on a small set of outline-to-edit examples to test structural gains.

Limit iterative runs to ~200 steps and focus retrieval early (first 50–150 steps) to save compute and get most citations fast.

Agent Features

Memory

  • explicit outline state as short-term memory
  • edit trajectory history for preference learning

Planning

  • long-horizon planning (200–300 steps)
  • two-stage value-guided optimization

Tool Use

  • LLM invocation for semantic edits
  • deterministic structural edits (direct exec)

Frameworks

  • PPO-style RL
  • preference alignment from edit transcripts

Is Agentic

true

Architectures

  • hierarchical state-action outline

Collaboration

  • human editing trajectories used as preference data

Optimization Features

Model Optimization

  • fine-tuning on outline-edit trajectories

System Optimization

  • separate direct structural edits from LLM-generated content to save model calls

Training Optimization

  • backward outline reconstruction (structure consistency)
  • forward value-guided RL with multi-aspect rewards

Inference Optimization

  • step budgeting (200 or 300 steps)
  • early-focused retrieval to reduce later compute

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Automatic evaluation relies partly on LLM judges, which can introduce bias and mismatch with human experts.
  • Experiments focus on survey generation; other scholarly genres were not evaluated.
  • State-action schemas are manually designed and may not generalize to very different document formats.
  • Iterative editing can accumulate errors if early structural choices are wrong.

When Not To Use

  • For short, one-shot generation tasks where global structure is trivial.
  • When you must produce original experimental results rather than surveys or summaries.
  • If you have only very small models (<2B) with little domain knowledge to leverage.

Failure Modes

  • Early structural mistakes propagate and are costly to revise across many edit steps.
  • LLM-based automatic judges can mis-evaluate factuality or citation relevance.
  • Small models fail to learn useful editing strategies due to limited world knowledge.

Core Entities

Models

  • Gemma2-2B
  • Qwen3-1.8B
  • Phi-3.8B
  • GPT-4o-mini
  • Claude-3.5-Haiku
  • Llama-3.1-Instruct-70B

Metrics

  • Precision
  • Recall
  • F1
  • structural completeness
  • citation relevance
  • information density

Datasets

  • arXiv-derived outline-edit dataset (1,500 articles)

Benchmarks

  • survey generation benchmark (OutlineForge benchmark)