NEXUSSUM: a three-agent LLM pipeline that converts dialogue, chunks scenes, and iteratively compresses to summarize books, movies, and TV

May 30, 20257 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it needs no finetuning, runs on off-the-shelf LLMs, and shows consistent metric gains and human-eval wins, but it requires GPU resources and prompt tuning to close readability gaps.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Hyuntak Kim, Byung-Hak Kim

Links

Abstract / PDF / Data

Why It Matters For Business

NEXUSSUM lets teams produce accurate, length-controlled summaries of very long narratives without model finetuning, improving semantic quality and factuality for products like content discovery, synopsis generation, and archival indexing.

Who Should Care

Summary TLDR

NEXUSSUM is a three-stage multi-LLM system for summarizing very long narratives without fine-tuning. It first converts dialogues into third-person prose, then creates scene-based hierarchical summaries, and finally applies iterative compression to meet length targets. On public narrative benchmarks (BookSum, MovieSum, MENSA, SummScreenFD) it raises semantic quality (BERTScore F1), claims up to +30% vs prior SOTA on BookSum, gives precise length control (Length Adherence Rate ~0.99 vs ~0.4 for zero-shot at 900 words), and boosts factuality with an agentic refinement step.

Problem Statement

Long narratives combine prose and multi-speaker dialogue across tens of thousands of tokens. Standard LLMs lose context, extractive pipelines drop plot coherence, and zero-shot prompts fail to control length. The paper targets accurate, coherent summaries at controlled lengths without model finetuning.

Main Contribution

Dialogue-to-Description Preprocessor: converts multi-speaker dialogue into unified third-person prose to reduce fragmentation.

Hierarchical Multi-LLM Pipeline: three agents (Preprocessor P, Summarizer S, Compressor C) that chunk, summarize, and iteratively compress long narratives.

Key Findings

NEXUSSUM achieves large semantic gains on long narratives, especially books.

NumbersBookSum: +30.0% BERTScore (F1) vs CachED

Practical UseUse the NEXUSSUM pipeline when summarizing full novels to substantially improve semantic similarity to human summaries on evaluated benchmarks.

Evidence RefSection 5.1; Figure 2; Table 2

Each agent stage adds measurable improvements; full pipeline outperforms partial variants on screenplay data.

NumbersMENSA: Zero-shot 54.81 → P+S+C 65.73 BERTScore (+10.92)

Practical UseKeep the three-stage setup (preprocess → summarize → compress) rather than skipping steps if you care about accuracy on scripts.

Evidence RefTable 3 (Ablation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BERTScore (F1) - BookSum+30.0% over CachEDCachED (BART Large, 406M)+30.0%BookSumSection 5.1; Table 2Table 2
BERTScore (F1) - MovieSum63.53HM-SR / prior SOTA+7.1% vs HM-SRMovieSumSection 5.1; Table 2Table 2

What To Try In 7 Days

Run the dialogue-to-description preprocessor on a small corpus of scripts or transcripts and compare coherence vs raw input.

Prototype a 3-stage pipeline (P→S→C) using an off-the-shelf LLM and vLLM inference to test length control on one dataset.

Tune chunk size δ (start at 300 words for books/scripts) and target lower bound θ to match your desired summary length.

Agent Features

Memory
scene-based chunking (short-term during pipeline)sentence-level chunking in compression
Planning
iterative compression (length-driven)Chain-of-Thought self-planning for preprocessing
Tool Use
vLLM (inference optimization)Mistral-Large-Instruct-2407Claude 3 HaikuGPT4o
Frameworks
NEXUSSUM
Is Agentic

Yes

Architectures
hierarchical multi-agent pipelinechunk-and-concat processing
Collaboration
sequential agents: Preprocessor (P) → Summarizer (S) → Compressor (C)optional refinement/rewrite agent (NEXUSSUMR) for readability

Optimization Features

Token Efficiency
chunking to limit per-call context sizeiterative compression to reduce final token output
Infra Optimization
authors run on four A100 GPUs
System Optimization
dynamic chunk sizing per datasetiteration cap (max 10) to bound cost
Training Optimization
no fine-tuning required; prompt engineering and few-shot used
Inference Optimization
use vLLM for efficient batched inferencetemperature=0.3, top-p=1.0 (authors' config)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

BookSum (public)MovieSum (public)MENSA (public components cited)SummScreenFD (public)

Risks & Boundaries

Limitations

Readability gap: experts rated zero-shot outputs as more readable despite lower factuality.

Compute and cost: experiments use four A100 GPUs and large LLMs (Mistral 123B).

When Not To Use

When short, highly fluent summaries are the top priority (zero-shot may be preferred).

When GPU budget or latency constraints prohibit multi-stage LLM calls.

Failure Modes

Over-compression reduces readability and removes contextual background.

Wrong chunk sizing leads to loss of long-range dependencies or excessive verbosity.

Core Entities

Models

Mistral-Large-Instruct-2407 (123B)Claude 3 HaikuGPT4oGPT4o-mini

Metrics

BERTScore (F1)ROUGE (1/2/L)Length Adherence Rate (LAR)NarrativeFactScore

Datasets

BookSumMovieSumMENSASummScreenFD

Benchmarks

BookSumMovieSumMENSASummScreenFD