NEXUSSUM: a three-agent LLM pipeline that converts dialogue, chunks scenes, and iteratively compresses to summarize books, movies, and TV

Overview

Decision SnapshotNeeds Validation

The method is practical: it needs no finetuning, runs on off-the-shelf LLMs, and shows consistent metric gains and human-eval wins, but it requires GPU resources and prompt tuning to close readability gaps.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Hyuntak Kim, Byung-Hak Kim

Links

Abstract / PDF / Data

Why It Matters For Business

NEXUSSUM lets teams produce accurate, length-controlled summaries of very long narratives without model finetuning, improving semantic quality and factuality for products like content discovery, synopsis generation, and archival indexing.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Data Scientist

Summary TLDR

NEXUSSUM is a three-stage multi-LLM system for summarizing very long narratives without fine-tuning. It first converts dialogues into third-person prose, then creates scene-based hierarchical summaries, and finally applies iterative compression to meet length targets. On public narrative benchmarks (BookSum, MovieSum, MENSA, SummScreenFD) it raises semantic quality (BERTScore F1), claims up to +30% vs prior SOTA on BookSum, gives precise length control (Length Adherence Rate ~0.99 vs ~0.4 for zero-shot at 900 words), and boosts factuality with an agentic refinement step.

Problem Statement

Long narratives combine prose and multi-speaker dialogue across tens of thousands of tokens. Standard LLMs lose context, extractive pipelines drop plot coherence, and zero-shot prompts fail to control length. The paper targets accurate, coherent summaries at controlled lengths without model finetuning.

Main Contribution

Dialogue-to-Description Preprocessor: converts multi-speaker dialogue into unified third-person prose to reduce fragmentation.

Hierarchical Multi-LLM Pipeline: three agents (Preprocessor P, Summarizer S, Compressor C) that chunk, summarize, and iteratively compress long narratives.

Key Findings

NEXUSSUM achieves large semantic gains on long narratives, especially books.

NumbersBookSum: +30.0% BERTScore (F1) vs CachED

Practical UseUse the NEXUSSUM pipeline when summarizing full novels to substantially improve semantic similarity to human summaries on evaluated benchmarks.

Evidence RefSection 5.1; Figure 2; Table 2

Each agent stage adds measurable improvements; full pipeline outperforms partial variants on screenplay data.

NumbersMENSA: Zero-shot 54.81 → P+S+C 65.73 BERTScore (+10.92)

Practical UseKeep the three-stage setup (preprocess → summarize → compress) rather than skipping steps if you care about accuracy on scripts.

Evidence RefTable 3 (Ablation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BERTScore (F1) - BookSum	+30.0% over CachED	CachED (BART Large, 406M)	+30.0%	BookSum	Section 5.1; Table 2	Table 2
BERTScore (F1) - MovieSum	63.53	HM-SR / prior SOTA	+7.1% vs HM-SR	MovieSum	Section 5.1; Table 2	Table 2

What To Try In 7 Days

Run the dialogue-to-description preprocessor on a small corpus of scripts or transcripts and compare coherence vs raw input.

Prototype a 3-stage pipeline (P→S→C) using an off-the-shelf LLM and vLLM inference to test length control on one dataset.

Tune chunk size δ (start at 300 words for books/scripts) and target lower bound θ to match your desired summary length.

Agent Features

Memory

scene-based chunking (short-term during pipeline)sentence-level chunking in compression

Planning

iterative compression (length-driven)Chain-of-Thought self-planning for preprocessing

Tool Use

vLLM (inference optimization)Mistral-Large-Instruct-2407Claude 3 HaikuGPT4o

Frameworks

NEXUSSUM

Is Agentic

Yes

Architectures

hierarchical multi-agent pipelinechunk-and-concat processing

Collaboration

sequential agents: Preprocessor (P) → Summarizer (S) → Compressor (C)optional refinement/rewrite agent (NEXUSSUMR) for readability

Optimization Features

Token Efficiency

chunking to limit per-call context sizeiterative compression to reduce final token output

Infra Optimization

authors run on four A100 GPUs

System Optimization

dynamic chunk sizing per datasetiteration cap (max 10) to bound cost

Training Optimization

no fine-tuning required; prompt engineering and few-shot used

Inference Optimization

use vLLM for efficient batched inferencetemperature=0.3, top-p=1.0 (authors' config)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

BookSum (public)MovieSum (public)MENSA (public components cited)SummScreenFD (public)

Risks & Boundaries

Limitations

Readability gap: experts rated zero-shot outputs as more readable despite lower factuality.

Compute and cost: experiments use four A100 GPUs and large LLMs (Mistral 123B).

When Not To Use

When short, highly fluent summaries are the top priority (zero-shot may be preferred).

When GPU budget or latency constraints prohibit multi-stage LLM calls.

Failure Modes

Over-compression reduces readability and removes contextual background.

Wrong chunk sizing leads to loss of long-range dependencies or excessive verbosity.

Core Entities

Models

Mistral-Large-Instruct-2407 (123B)Claude 3 HaikuGPT4oGPT4o-mini

Metrics

BERTScore (F1)ROUGE (1/2/L)Length Adherence Rate (LAR)NarrativeFactScore

Datasets

BookSumMovieSumMENSASummScreenFD

Benchmarks

BookSumMovieSumMENSASummScreenFD

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

NEXUSSUM achieves large semantic gains on long narratives, especially books.

Each agent stage adds measurable improvements; full pipeline outperforms partial variants on screenplay data.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding