Overview
Production Readiness
0.7
Novelty Score
0.3
Cost Impact Score
0.8
Citation Count
11
Why It Matters For Business
LLMs let teams deploy usable summaries quickly with zero/few-shot prompts, but hallucination and unreliable automatic metrics mean businesses must pair LLMs with retrieval, human checks, or smaller fine-tuned models for safety.
Summary TLDR
This is a 30-page survey that traces text summarization from classic statistical methods through neural and PLM fine-tuning to the current LLM era. It catalogs datasets, evaluation metrics, pre-LLM methods (statistical, deep learning, PLMs) and LLM-era work (benchmarking, modeling, and evaluation). The authors highlight that LLMs enable strong zero-/few-shot summarization but expose evaluation, factuality, bias, and efficiency gaps. The paper synthesizes trends, open challenges (hallucination, bias, compute, personalization, interpretability) and practical future directions (multimodal, domain LLMs, human-in-loop).
Problem Statement
Summarization methods have evolved rapidly (statistical → deep learning → PLM fine-tuning → LLMs). This survey asks: what changed, how do we evaluate modern summarizers, what are gaps introduced by LLMs, and what practical directions help researchers and engineers adapt?
Main Contribution
Comprehensive survey covering summarization across four paradigms: statistical, deep learning, PLM fine-tuning, and LLMs.
First focused taxonomy and organized review of LLM-based summarization literature (benchmarks, modeling, evaluation).
Curated tables of datasets, metrics, representative methods, and CNN/DM quantitative comparisons plus discussion of open challenges and future directions.
Key Findings
LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.
Automatic metrics like ROUGE can misjudge LLM outputs and understate human-level quality.
Factuality and hallucination remain major open problems for LLM summarization.
Position (lead) bias is persistent: models prefer start/end of text and ignore middle content.
PLM fine-tuning and PLM-specific pretraining delivered steady ROUGE gains on CNN/DM before LLMs.
Results
ROUGE-1 (CNN/DM)
LLM zero/few-shot human preference
Position bias (coverage pattern)
Who Should Care
What To Try In 7 Days
Run prompt-based zero-shot summaries with an LLM on your domain and compare with a small fine-tuned PLM using human spot-checks.
Add simple retrieval (RAG) to the prompt to reduce hallucinations for factual content.
Evaluate outputs with an LLM-based evaluator (GPTScore/G-Eval) and a small human panel to identify metric gaps.
Agent Features
Tool Use
- LLM evaluators
- retrieval modules (RAG)
- knowledge extractors
Frameworks
- Iterative refine (SummIt)
- self-evaluation agents
Architectures
- multi-agent summarizers (summarizer + evaluator loop)
- prompt-chaining (draft/critique/refine)
- chain-of-thought prompting
Optimization Features
Token Efficiency
- chunking and hierarchical summarization to limit context
Infra Optimization
- smaller LLMs vs large LLMs cost/quality trade-offs
Model Optimization
- distillation to smaller models (InheritSumm, TriSum)
- LoRA
System Optimization
- RAG to reduce hallucination and context length
Training Optimization
- instruction fine-tuning
- contrastive learning (SimCLS, BRIO)
- RL
Inference Optimization
- prompt-chaining vs stepwise prompting trade-offs
- sliding-window generation for long docs
Reproducibility
Data Urls
- public datasets listed (CNN/DM, XSum, PubMed, arXiv, MultiNews, QMSum, etc.)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey synthesizes existing studies; it does not provide new empirical experiments beyond aggregated tables.
- ROUGE-centric quantitative comparisons have known blind spots for LLM outputs.
- Coverage of very recent closed-source LLM internals and proprietary benchmarks is limited by available literature.
When Not To Use
- Do not rely on LLM-only summarizers for high-stakes medical, legal, or regulatory outputs without human verification.
- Avoid trusting automatic ROUGE-only comparisons when choosing an LLM summary pipeline.
Failure Modes
- Hallucination: inventing facts not in source.
- Omission: missing important mid-document content due to position bias.
- Bias: underrepresenting perspectives or amplifying dataset biases.
- Oververbosity or verbosity mismatch when concise summaries are required.
- Metric mismatch: automatic scores diverge from human judgment.
Core Entities
Models
- BERT
- BART
- T5
- PEGASUS
- LED
- PRIMERA
- GPT-3
- GPT-3.5/ChatGPT
- GPT-4
- Claude-2
- LLaMa/Llama2
- Flan-T5
Metrics
- ROUGE
- BERTScore
- MoverScore
- FactCC
- SummaC
- FEQA/QAGS (QA-based)
- GPTScore
- G-Eval
Datasets
- CNN/DM
- XSum
- NYT
- NEWSROOM
- Gigaword
- CCSUM
- WikiHow
- Reddit/TIFU
- SAMSum
- PubMed
- arXiv
- MultiNews
- QMSum
- DUC
- Xl-Sum
Benchmarks
- CNN/DM
- XSum
- MultiNews
- QMSum
- DUC
- MiddleSum
- AggreFact (factuality datasets)

