Overview
The method is practical: it needs no finetuning, runs on off-the-shelf LLMs, and shows consistent metric gains and human-eval wins, but it requires GPU resources and prompt tuning to close readability gaps.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
NEXUSSUM lets teams produce accurate, length-controlled summaries of very long narratives without model finetuning, improving semantic quality and factuality for products like content discovery, synopsis generation, and archival indexing.
Who Should Care
Summary TLDR
NEXUSSUM is a three-stage multi-LLM system for summarizing very long narratives without fine-tuning. It first converts dialogues into third-person prose, then creates scene-based hierarchical summaries, and finally applies iterative compression to meet length targets. On public narrative benchmarks (BookSum, MovieSum, MENSA, SummScreenFD) it raises semantic quality (BERTScore F1), claims up to +30% vs prior SOTA on BookSum, gives precise length control (Length Adherence Rate ~0.99 vs ~0.4 for zero-shot at 900 words), and boosts factuality with an agentic refinement step.
Problem Statement
Long narratives combine prose and multi-speaker dialogue across tens of thousands of tokens. Standard LLMs lose context, extractive pipelines drop plot coherence, and zero-shot prompts fail to control length. The paper targets accurate, coherent summaries at controlled lengths without model finetuning.
Main Contribution
Dialogue-to-Description Preprocessor: converts multi-speaker dialogue into unified third-person prose to reduce fragmentation.
Hierarchical Multi-LLM Pipeline: three agents (Preprocessor P, Summarizer S, Compressor C) that chunk, summarize, and iteratively compress long narratives.
Key Findings
NEXUSSUM achieves large semantic gains on long narratives, especially books.
Each agent stage adds measurable improvements; full pipeline outperforms partial variants on screenplay data.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| BERTScore (F1) - BookSum | +30.0% over CachED | CachED (BART Large, 406M) | +30.0% | BookSum | Section 5.1; Table 2 | Table 2 |
| BERTScore (F1) - MovieSum | 63.53 | HM-SR / prior SOTA | +7.1% vs HM-SR | MovieSum | Section 5.1; Table 2 | Table 2 |
What To Try In 7 Days
Run the dialogue-to-description preprocessor on a small corpus of scripts or transcripts and compare coherence vs raw input.
Prototype a 3-stage pipeline (P→S→C) using an off-the-shelf LLM and vLLM inference to test length control on one dataset.
Tune chunk size δ (start at 300 words for books/scripts) and target lower bound θ to match your desired summary length.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Readability gap: experts rated zero-shot outputs as more readable despite lower factuality.
Compute and cost: experiments use four A100 GPUs and large LLMs (Mistral 123B).
When Not To Use
When short, highly fluent summaries are the top priority (zero-shot may be preferred).
When GPU budget or latency constraints prohibit multi-stage LLM calls.
Failure Modes
Over-compression reduces readability and removes contextual background.
Wrong chunk sizing leads to loss of long-range dependencies or excessive verbosity.

