Overview
This survey compiles existing evidence and provides practical guidance; it is broadly useful but primarily synthesizes prior work rather than introducing new experiments.
Citations11
Evidence Strength0.75
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 30%
Why It Matters For Business
LLMs let teams deploy usable summaries quickly with zero/few-shot prompts, but hallucination and unreliable automatic metrics mean businesses must pair LLMs with retrieval, human checks, or smaller fine-tuned models for safety.
Who Should Care
Summary TLDR
This is a 30-page survey that traces text summarization from classic statistical methods through neural and PLM fine-tuning to the current LLM era. It catalogs datasets, evaluation metrics, pre-LLM methods (statistical, deep learning, PLMs) and LLM-era work (benchmarking, modeling, and evaluation). The authors highlight that LLMs enable strong zero-/few-shot summarization but expose evaluation, factuality, bias, and efficiency gaps. The paper synthesizes trends, open challenges (hallucination, bias, compute, personalization, interpretability) and practical future directions (multimodal, domain LLMs, human-in-loop).
Problem Statement
Summarization methods have evolved rapidly (statistical → deep learning → PLM fine-tuning → LLMs). This survey asks: what changed, how do we evaluate modern summarizers, what are gaps introduced by LLMs, and what practical directions help researchers and engineers adapt?
Main Contribution
Comprehensive survey covering summarization across four paradigms: statistical, deep learning, PLM fine-tuning, and LLMs.
First focused taxonomy and organized review of LLM-based summarization literature (benchmarks, modeling, evaluation).
Key Findings
LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.
Automatic metrics like ROUGE can misjudge LLM outputs and understate human-level quality.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ROUGE-1 (CNN/DM) | SimCLS 46.67; SliSum (Claude2) 47.75; BART 44.16; PEGASUS 44.17 | TextRank 33.2 (2004) | Modern PLMs & contrastive methods improve R1 ≈ +11–14 vs early unsupervised | CNN/DM | Table 6 in paper lists per-model ROUGE scores | Table 6 |
| LLM zero/few-shot human preference | Human studies report annotator preference for GPT-3/GPT-4 summaries over baselines | Fine-tuned supervised systems (various) | Human preference despite lower automatic scores | News benchmarks (CNN/DM, XSum) | Benchmarking studies summarized in Section 4.1 | [69, 270, 172] |
What To Try In 7 Days
Run prompt-based zero-shot summaries with an LLM on your domain and compare with a small fine-tuned PLM using human spot-checks.
Add simple retrieval (RAG) to the prompt to reduce hallucinations for factual content.
Evaluate outputs with an LLM-based evaluator (GPTScore/G-Eval) and a small human panel to identify metric gaps.
Agent Features
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Survey synthesizes existing studies; it does not provide new empirical experiments beyond aggregated tables.
ROUGE-centric quantitative comparisons have known blind spots for LLM outputs.
When Not To Use
Do not rely on LLM-only summarizers for high-stakes medical, legal, or regulatory outputs without human verification.
Avoid trusting automatic ROUGE-only comparisons when choosing an LLM summary pipeline.
Failure Modes
Hallucination: inventing facts not in source.
Omission: missing important mid-document content due to position bias.

