NEXUSSUM: a three-agent LLM pipeline that converts dialogue, chunks scenes, and iteratively compresses to summarize books, movies, and TV

May 30, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Hyuntak Kim, Byung-Hak Kim

Links

Abstract / PDF

Why It Matters For Business

NEXUSSUM lets teams produce accurate, length-controlled summaries of very long narratives without model finetuning, improving semantic quality and factuality for products like content discovery, synopsis generation, and archival indexing.

Summary TLDR

NEXUSSUM is a three-stage multi-LLM system for summarizing very long narratives without fine-tuning. It first converts dialogues into third-person prose, then creates scene-based hierarchical summaries, and finally applies iterative compression to meet length targets. On public narrative benchmarks (BookSum, MovieSum, MENSA, SummScreenFD) it raises semantic quality (BERTScore F1), claims up to +30% vs prior SOTA on BookSum, gives precise length control (Length Adherence Rate ~0.99 vs ~0.4 for zero-shot at 900 words), and boosts factuality with an agentic refinement step.

Problem Statement

Long narratives combine prose and multi-speaker dialogue across tens of thousands of tokens. Standard LLMs lose context, extractive pipelines drop plot coherence, and zero-shot prompts fail to control length. The paper targets accurate, coherent summaries at controlled lengths without model finetuning.

Main Contribution

Dialogue-to-Description Preprocessor: converts multi-speaker dialogue into unified third-person prose to reduce fragmentation.

Hierarchical Multi-LLM Pipeline: three agents (Preprocessor P, Summarizer S, Compressor C) that chunk, summarize, and iteratively compress long narratives.

Dynamic length control: iterative compression with a lower bound target enforces word-count constraints while preserving key facts.

Empirical tuning recipes: guidance on chunk size (δ) and lower-bound (θ) per dataset to trade off detail and compression.

State-of-the-art results: large gains in automatic metrics and human-preference analyses across four long-form benchmarks.

Key Findings

NEXUSSUM achieves large semantic gains on long narratives, especially books.

NumbersBookSum: +30.0% BERTScore (F1) vs CachED

Each agent stage adds measurable improvements; full pipeline outperforms partial variants on screenplay data.

NumbersMENSA: Zero-shot 54.81 → P+S+C 65.73 BERTScore (+10.92)

NEXUSSUM provides much tighter length control than zero-shot prompting.

NumbersTarget 900 words: LAR Ours 0.99 vs Zero-Shot 0.40; BERTScore 65.73 vs 58.18

Chunk size strongly controls compression rate and quality.

NumbersP+S 9675 words → C1(δ=500)=4069 (57.94%) vs C1(δ=5000)=909 (90.59%)

Factuality is high and improves with agent refinements.

NumbersNarrativeFactScore: NEXUSSUM 90.16 → with refinements 96.83

Results

BERTScore (F1) - BookSum

Value+30.0% over CachED

BaselineCachED (BART Large, 406M)

BERTScore (F1) - MovieSum

Value63.53

BaselineHM-SR / prior SOTA

BERTScore (F1) - MENSA

Value65.73

BaselineZero-shot 54.81

Length Adherence Rate (LAR)

Value0.99 (target 900 words)

BaselineZero-Shot 0.40

NarrativeFactScore

Value90.16 (base) → 96.83 (with refinements)

BaselineHierarchically Merging + refinements

Who Should Care

What To Try In 7 Days

Run the dialogue-to-description preprocessor on a small corpus of scripts or transcripts and compare coherence vs raw input.

Prototype a 3-stage pipeline (P→S→C) using an off-the-shelf LLM and vLLM inference to test length control on one dataset.

Tune chunk size δ (start at 300 words for books/scripts) and target lower bound θ to match your desired summary length.

Agent Features

Memory

  • scene-based chunking (short-term during pipeline)
  • sentence-level chunking in compression

Planning

  • iterative compression (length-driven)
  • Chain-of-Thought self-planning for preprocessing

Tool Use

  • vLLM (inference optimization)
  • Mistral-Large-Instruct-2407
  • Claude 3 Haiku
  • GPT4o

Frameworks

  • NEXUSSUM

Is Agentic

true

Architectures

  • hierarchical multi-agent pipeline
  • chunk-and-concat processing

Collaboration

  • sequential agents: Preprocessor (P) → Summarizer (S) → Compressor (C)
  • optional refinement/rewrite agent (NEXUSSUMR) for readability

Optimization Features

Token Efficiency

  • chunking to limit per-call context size
  • iterative compression to reduce final token output

Infra Optimization

  • authors run on four A100 GPUs

System Optimization

  • dynamic chunk sizing per dataset
  • iteration cap (max 10) to bound cost

Training Optimization

  • no fine-tuning required; prompt engineering and few-shot used

Inference Optimization

  • use vLLM for efficient batched inference
  • temperature=0.3, top-p=1.0 (authors' config)

Reproducibility

Data Urls

  • BookSum (public)
  • MovieSum (public)
  • MENSA (public components cited)
  • SummScreenFD (public)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Readability gap: experts rated zero-shot outputs as more readable despite lower factuality.
  • Compute and cost: experiments use four A100 GPUs and large LLMs (Mistral 123B).
  • Sensitivity to hyperparameters: chunk size (δ) and lower bound (θ) materially affect quality.
  • Automated metrics imperfect: high BERTScore/ROUGE do not guarantee human-preferred fluency.

When Not To Use

  • When short, highly fluent summaries are the top priority (zero-shot may be preferred).
  • When GPU budget or latency constraints prohibit multi-stage LLM calls.
  • When you cannot define clear target lengths or lower bounds for summaries.

Failure Modes

  • Over-compression reduces readability and removes contextual background.
  • Wrong chunk sizing leads to loss of long-range dependencies or excessive verbosity.
  • Rigid preprocessor phrasing can produce dense, less natural prose without a rewrite step.
  • Excessive iterative compression may induce omissions or subtle hallucinations without factual refinement.

Core Entities

Models

  • Mistral-Large-Instruct-2407 (123B)
  • Claude 3 Haiku
  • GPT4o
  • GPT4o-mini

Metrics

  • BERTScore (F1)
  • ROUGE (1/2/L)
  • Length Adherence Rate (LAR)
  • NarrativeFactScore

Datasets

  • BookSum
  • MovieSum
  • MENSA
  • SummScreenFD

Benchmarks

  • BookSum
  • MovieSum
  • MENSA
  • SummScreenFD