STIG: encode multi-stage introduction logic as stage tokens so an LLM writes an entire Introduction in one inference

December 28, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Meicong Zhang, Tiancheng su, Guoxiu He

Links

Abstract / PDF

Why It Matters For Business

Convert multi-call agent pipelines into a single finetuned model with stage tokens to reduce API calls, cut token costs, and produce more structurally coherent section drafts for academic-style content.

Summary TLDR

The paper proposes STIG, a simple finetuning method that encodes a writing workflow as explicit stage tokens so a single LLM call can produce an entire academic Introduction. Authors build a dataset from ~3,800 ACL papers (test on 1,176 ACL 2025 papers), add eight stage token pairs (outline + content per subsection), and finetune open models (Qwen2.5-7B, Llama3.1-8B). STIG yields better structural coherence and content coverage than multi-agent/outline pipelines, and uses tokens more efficiently (claimed 3.3× effective token rate vs. AutoSurvey). Code and dataset are provided in supplements.

Problem Statement

Agentic workflows split writing into many agent calls (outline → draft → integrate) which causes long reasoning chains, error accumulation, high token use, latency, and fragile designs that need manual reconfiguration across domains.

Main Contribution

Introduce STIG: parametric stage tokens that encode multi-stage introduction logic into the model so a single inference outputs all stages.

Create an annotated training corpus from ~3,800 ACL papers (1,176 test on ACL 2025) with eight stage labels (outline+content for four subsections).

Show STIG improves structure, semantic fidelity, content coverage and token efficiency versus agentic workflows and prompting baselines.

Key Findings

STIG raises structural rationality (SR) substantially versus agentic baselines.

NumbersSR 0.832 (STIG) vs 0.658 (AutoSurvey) on Qwen2.5-7B

STIG matches or improves semantic similarity and content coverage over baselines.

NumbersSS 0.977, CC 0.442 (Qwen2.5-7B STIG) vs CC 0.333 (AutoSurvey)

STIG is more token-efficient than agentic workflows.

NumbersEffectiveness rate 3.3× vs AutoSurvey; ~2× vs Stage Writing FT

Human evaluators prefer STIG outputs on average.

NumbersAverage human rank 2.14 for STIG (best among 7 methods) on 50 ACL papers

Results

Semantic Similarity (SS)

Value0.977 (Qwen2.5-7B STIG)

Baseline0.966 (AutoSurvey)

Structural Rationality (SR)

Value0.832 (Qwen2.5-7B STIG)

Baseline0.658 (AutoSurvey)

Content Coverage (CC)

Value0.442 (Qwen2.5-7B STIG)

Baseline0.333 (AutoSurvey)

Token-efficiency (effective generation rate)

Value3.3× (STIG vs AutoSurvey)

BaselineAutoSurvey

Human average rank (lower better)

Value2.14 (STIG)

Baseline2.46 (FT w/o Stage Writing best baseline)

Narrative Quality (Perplexity NQ)

Value24.810 (Qwen2.5-7B STIG)

Baseline18.084 (AutoSurvey)

Who Should Care

What To Try In 7 Days

Extract 100 domain examples of introductions and add 4 subsection outlines + content pairs.

Add a small set of custom stage tokens (outline/content per subsection) and instruction-finetune an existing 7B instruction model.

Compare single-call generation vs your current agent pipeline on structure (SR) and token consumption.

Agent Features

Memory

  • Max context length up to 8K tokens during training/inference

Planning

  • Stage-guided generation (fixed ordered stages)
  • Outline→content sequencing encoded in tokens

Tool Use

  • LoRA
  • LLaMA-Factory (finetuning)
  • MinerU (parsing PDFs)
  • ZeRO3 (memory optimization)

Frameworks

  • LLaMA-Factory
  • ZeRO3
  • LoRA
  • MinerU

Architectures

  • Instruction-tuned LLM backbone (decoder-only transformer)
  • Parametric stage tokens added to tokenizer

Collaboration

  • Replaces multi-agent orchestration with a single model; no external agent calls

Optimization Features

Token Efficiency

  • Claimed 3.3× effective generation rate vs AutoSurvey
  • Claimed ~2× efficiency over Stage Writing FT

Infra Optimization

  • ZeRO3 for distributed finetuning
  • A800 GPUs

Model Optimization

  • LoRA

System Optimization

  • Increases context window to 8K tokens

Training Optimization

  • Instruction finetuning with special stage tokens
  • ZeRO3 to optimize GPU memory
  • 8× A800 GPUs used

Inference Optimization

  • Single-inference end-to-end generation to avoid multiple API calls
  • Removes outlines and parses stage tokens after generation

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training uses ACL-domain introductions; may not generalize to other academic fields or non-conference formats.
  • Relies on high-quality annotations (outline↔content alignment); noisy labels will degrade stage learning.
  • Requires GPU resources and 8K context support for best results.

When Not To Use

  • If you lack annotated intro-outline pairs in your domain.
  • When factual accuracy is critical and hallucination risk cannot be tolerated without heavy human verification.
  • For short non-structured blurbs where multi-stage structure is unnecessary.

Failure Modes

  • Model may hallucinate results or baselines when training data contains such fabrications.
  • Stage tokens could be misused, producing wrong subsection assignments if annotation patterns differ in new domains.
  • Higher perplexity in some cases indicates fluency trade-offs despite better structure.

Core Entities

Models

  • Qwen2.5-7B-Instruct
  • Llama3.1-8B-Instruct
  • GPT-4o
  • Qwen2.5-32B-Instruct

Metrics

  • BERTScore
  • Structural Rationality (sentence-level misclassification rate)
  • Content Coverage (SBERT-weighted sentence similarity)
  • Perplexity (GPT-2 PPL)
  • Quotation Constraint (QC)
  • Human ranking

Datasets

  • ACL papers 2021-2025 (training corpus ~3,800)
  • ACL 2025 Main Conference test set (1,176 papers)
  • CVPR subset (102 papers) for generalization

Benchmarks

  • Structural Rationality (SR)
  • Content Coverage (CC)
  • BERTScore (SS)
  • Perplexity (Narrative Quality NQ)

Context Entities

Models

  • AutoSurvey
  • GPT (baseline prompts)
  • Outline Writing pipeline

Datasets

  • ACL Anthology (source for parsing)
  • Semantic Scholar API (for baseline abstracts)