Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Convert multi-call agent pipelines into a single finetuned model with stage tokens to reduce API calls, cut token costs, and produce more structurally coherent section drafts for academic-style content.
Summary TLDR
The paper proposes STIG, a simple finetuning method that encodes a writing workflow as explicit stage tokens so a single LLM call can produce an entire academic Introduction. Authors build a dataset from ~3,800 ACL papers (test on 1,176 ACL 2025 papers), add eight stage token pairs (outline + content per subsection), and finetune open models (Qwen2.5-7B, Llama3.1-8B). STIG yields better structural coherence and content coverage than multi-agent/outline pipelines, and uses tokens more efficiently (claimed 3.3× effective token rate vs. AutoSurvey). Code and dataset are provided in supplements.
Problem Statement
Agentic workflows split writing into many agent calls (outline → draft → integrate) which causes long reasoning chains, error accumulation, high token use, latency, and fragile designs that need manual reconfiguration across domains.
Main Contribution
Introduce STIG: parametric stage tokens that encode multi-stage introduction logic into the model so a single inference outputs all stages.
Create an annotated training corpus from ~3,800 ACL papers (1,176 test on ACL 2025) with eight stage labels (outline+content for four subsections).
Show STIG improves structure, semantic fidelity, content coverage and token efficiency versus agentic workflows and prompting baselines.
Key Findings
STIG raises structural rationality (SR) substantially versus agentic baselines.
STIG matches or improves semantic similarity and content coverage over baselines.
STIG is more token-efficient than agentic workflows.
Human evaluators prefer STIG outputs on average.
Results
Semantic Similarity (SS)
Structural Rationality (SR)
Content Coverage (CC)
Token-efficiency (effective generation rate)
Human average rank (lower better)
Narrative Quality (Perplexity NQ)
Who Should Care
What To Try In 7 Days
Extract 100 domain examples of introductions and add 4 subsection outlines + content pairs.
Add a small set of custom stage tokens (outline/content per subsection) and instruction-finetune an existing 7B instruction model.
Compare single-call generation vs your current agent pipeline on structure (SR) and token consumption.
Agent Features
Memory
- Max context length up to 8K tokens during training/inference
Planning
- Stage-guided generation (fixed ordered stages)
- Outline→content sequencing encoded in tokens
Tool Use
- LoRA
- LLaMA-Factory (finetuning)
- MinerU (parsing PDFs)
- ZeRO3 (memory optimization)
Frameworks
- LLaMA-Factory
- ZeRO3
- LoRA
- MinerU
Architectures
- Instruction-tuned LLM backbone (decoder-only transformer)
- Parametric stage tokens added to tokenizer
Collaboration
- Replaces multi-agent orchestration with a single model; no external agent calls
Optimization Features
Token Efficiency
- Claimed 3.3× effective generation rate vs AutoSurvey
- Claimed ~2× efficiency over Stage Writing FT
Infra Optimization
- ZeRO3 for distributed finetuning
- A800 GPUs
Model Optimization
- LoRA
System Optimization
- Increases context window to 8K tokens
Training Optimization
- Instruction finetuning with special stage tokens
- ZeRO3 to optimize GPU memory
- 8× A800 GPUs used
Inference Optimization
- Single-inference end-to-end generation to avoid multiple API calls
- Removes outlines and parses stage tokens after generation
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training uses ACL-domain introductions; may not generalize to other academic fields or non-conference formats.
- Relies on high-quality annotations (outline↔content alignment); noisy labels will degrade stage learning.
- Requires GPU resources and 8K context support for best results.
When Not To Use
- If you lack annotated intro-outline pairs in your domain.
- When factual accuracy is critical and hallucination risk cannot be tolerated without heavy human verification.
- For short non-structured blurbs where multi-stage structure is unnecessary.
Failure Modes
- Model may hallucinate results or baselines when training data contains such fabrications.
- Stage tokens could be misused, producing wrong subsection assignments if annotation patterns differ in new domains.
- Higher perplexity in some cases indicates fluency trade-offs despite better structure.
Core Entities
Models
- Qwen2.5-7B-Instruct
- Llama3.1-8B-Instruct
- GPT-4o
- Qwen2.5-32B-Instruct
Metrics
- BERTScore
- Structural Rationality (sentence-level misclassification rate)
- Content Coverage (SBERT-weighted sentence similarity)
- Perplexity (GPT-2 PPL)
- Quotation Constraint (QC)
- Human ranking
Datasets
- ACL papers 2021-2025 (training corpus ~3,800)
- ACL 2025 Main Conference test set (1,176 papers)
- CVPR subset (102 papers) for generalization
Benchmarks
- Structural Rationality (SR)
- Content Coverage (CC)
- BERTScore (SS)
- Perplexity (Narrative Quality NQ)
Context Entities
Models
- AutoSurvey
- GPT (baseline prompts)
- Outline Writing pipeline
Datasets
- ACL Anthology (source for parsing)
- Semantic Scholar API (for baseline abstracts)

