Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
NEXUSSUM lets teams produce accurate, length-controlled summaries of very long narratives without model finetuning, improving semantic quality and factuality for products like content discovery, synopsis generation, and archival indexing.
Summary TLDR
NEXUSSUM is a three-stage multi-LLM system for summarizing very long narratives without fine-tuning. It first converts dialogues into third-person prose, then creates scene-based hierarchical summaries, and finally applies iterative compression to meet length targets. On public narrative benchmarks (BookSum, MovieSum, MENSA, SummScreenFD) it raises semantic quality (BERTScore F1), claims up to +30% vs prior SOTA on BookSum, gives precise length control (Length Adherence Rate ~0.99 vs ~0.4 for zero-shot at 900 words), and boosts factuality with an agentic refinement step.
Problem Statement
Long narratives combine prose and multi-speaker dialogue across tens of thousands of tokens. Standard LLMs lose context, extractive pipelines drop plot coherence, and zero-shot prompts fail to control length. The paper targets accurate, coherent summaries at controlled lengths without model finetuning.
Main Contribution
Dialogue-to-Description Preprocessor: converts multi-speaker dialogue into unified third-person prose to reduce fragmentation.
Hierarchical Multi-LLM Pipeline: three agents (Preprocessor P, Summarizer S, Compressor C) that chunk, summarize, and iteratively compress long narratives.
Dynamic length control: iterative compression with a lower bound target enforces word-count constraints while preserving key facts.
Empirical tuning recipes: guidance on chunk size (δ) and lower-bound (θ) per dataset to trade off detail and compression.
State-of-the-art results: large gains in automatic metrics and human-preference analyses across four long-form benchmarks.
Key Findings
NEXUSSUM achieves large semantic gains on long narratives, especially books.
Each agent stage adds measurable improvements; full pipeline outperforms partial variants on screenplay data.
NEXUSSUM provides much tighter length control than zero-shot prompting.
Chunk size strongly controls compression rate and quality.
Factuality is high and improves with agent refinements.
Results
BERTScore (F1) - BookSum
BERTScore (F1) - MovieSum
BERTScore (F1) - MENSA
Length Adherence Rate (LAR)
NarrativeFactScore
Who Should Care
What To Try In 7 Days
Run the dialogue-to-description preprocessor on a small corpus of scripts or transcripts and compare coherence vs raw input.
Prototype a 3-stage pipeline (P→S→C) using an off-the-shelf LLM and vLLM inference to test length control on one dataset.
Tune chunk size δ (start at 300 words for books/scripts) and target lower bound θ to match your desired summary length.
Agent Features
Memory
- scene-based chunking (short-term during pipeline)
- sentence-level chunking in compression
Planning
- iterative compression (length-driven)
- Chain-of-Thought self-planning for preprocessing
Tool Use
- vLLM (inference optimization)
- Mistral-Large-Instruct-2407
- Claude 3 Haiku
- GPT4o
Frameworks
- NEXUSSUM
Is Agentic
true
Architectures
- hierarchical multi-agent pipeline
- chunk-and-concat processing
Collaboration
- sequential agents: Preprocessor (P) → Summarizer (S) → Compressor (C)
- optional refinement/rewrite agent (NEXUSSUMR) for readability
Optimization Features
Token Efficiency
- chunking to limit per-call context size
- iterative compression to reduce final token output
Infra Optimization
- authors run on four A100 GPUs
System Optimization
- dynamic chunk sizing per dataset
- iteration cap (max 10) to bound cost
Training Optimization
- no fine-tuning required; prompt engineering and few-shot used
Inference Optimization
- use vLLM for efficient batched inference
- temperature=0.3, top-p=1.0 (authors' config)
Reproducibility
Data Urls
- BookSum (public)
- MovieSum (public)
- MENSA (public components cited)
- SummScreenFD (public)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Readability gap: experts rated zero-shot outputs as more readable despite lower factuality.
- Compute and cost: experiments use four A100 GPUs and large LLMs (Mistral 123B).
- Sensitivity to hyperparameters: chunk size (δ) and lower bound (θ) materially affect quality.
- Automated metrics imperfect: high BERTScore/ROUGE do not guarantee human-preferred fluency.
When Not To Use
- When short, highly fluent summaries are the top priority (zero-shot may be preferred).
- When GPU budget or latency constraints prohibit multi-stage LLM calls.
- When you cannot define clear target lengths or lower bounds for summaries.
Failure Modes
- Over-compression reduces readability and removes contextual background.
- Wrong chunk sizing leads to loss of long-range dependencies or excessive verbosity.
- Rigid preprocessor phrasing can produce dense, less natural prose without a rewrite step.
- Excessive iterative compression may induce omissions or subtle hallucinations without factual refinement.
Core Entities
Models
- Mistral-Large-Instruct-2407 (123B)
- Claude 3 Haiku
- GPT4o
- GPT4o-mini
Metrics
- BERTScore (F1)
- ROUGE (1/2/L)
- Length Adherence Rate (LAR)
- NarrativeFactScore
Datasets
- BookSum
- MovieSum
- MENSA
- SummScreenFD
Benchmarks
- BookSum
- MovieSum
- MENSA
- SummScreenFD

