A concise, up-to-date roadmap of text summarization research before and during the LLM era

June 17, 20247 min

Overview

Decision SnapshotReady For Pilot

This survey compiles existing evidence and provides practical guidance; it is broadly useful but primarily synthesizes prior work rather than introducing new experiments.

Citations11

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 30%

Authors

Haopeng Zhang, Philip S. Yu, Jiawei Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs let teams deploy usable summaries quickly with zero/few-shot prompts, but hallucination and unreliable automatic metrics mean businesses must pair LLMs with retrieval, human checks, or smaller fine-tuned models for safety.

Who Should Care

Summary TLDR

This is a 30-page survey that traces text summarization from classic statistical methods through neural and PLM fine-tuning to the current LLM era. It catalogs datasets, evaluation metrics, pre-LLM methods (statistical, deep learning, PLMs) and LLM-era work (benchmarking, modeling, and evaluation). The authors highlight that LLMs enable strong zero-/few-shot summarization but expose evaluation, factuality, bias, and efficiency gaps. The paper synthesizes trends, open challenges (hallucination, bias, compute, personalization, interpretability) and practical future directions (multimodal, domain LLMs, human-in-loop).

Problem Statement

Summarization methods have evolved rapidly (statistical → deep learning → PLM fine-tuning → LLMs). This survey asks: what changed, how do we evaluate modern summarizers, what are gaps introduced by LLMs, and what practical directions help researchers and engineers adapt?

Main Contribution

Comprehensive survey covering summarization across four paradigms: statistical, deep learning, PLM fine-tuning, and LLMs.

First focused taxonomy and organized review of LLM-based summarization literature (benchmarks, modeling, evaluation).

Key Findings

LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.

NumbersHuman studies report annotator preference for GPT-3/GPT-4 summaries (multiple papers cited)

Practical UseTry off-the-shelf LLM prompts for quick prototyping, but verify quality with human checks because automatic metrics can disagree.

Evidence Ref[69, 270, 172]

Automatic metrics like ROUGE can misjudge LLM outputs and understate human-level quality.

NumbersSurvey notes ROUGE remains standard but often misaligned; examples show LLMs score lower on ROUGE despite human pref.

Practical UseUse human evaluation or LLM-based multi-aspect evaluators alongside ROUGE when comparing LLM summarizers.

Evidence RefTable 6; [69, 269]

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-1 (CNN/DM)SimCLS 46.67; SliSum (Claude2) 47.75; BART 44.16; PEGASUS 44.17TextRank 33.2 (2004)Modern PLMs & contrastive methods improve R1 ≈ +1114 vs early unsupervisedCNN/DMTable 6 in paper lists per-model ROUGE scoresTable 6
LLM zero/few-shot human preferenceHuman studies report annotator preference for GPT-3/GPT-4 summaries over baselinesFine-tuned supervised systems (various)Human preference despite lower automatic scoresNews benchmarks (CNN/DM, XSum)Benchmarking studies summarized in Section 4.1[69, 270, 172]

What To Try In 7 Days

Run prompt-based zero-shot summaries with an LLM on your domain and compare with a small fine-tuned PLM using human spot-checks.

Add simple retrieval (RAG) to the prompt to reduce hallucinations for factual content.

Evaluate outputs with an LLM-based evaluator (GPTScore/G-Eval) and a small human panel to identify metric gaps.

Agent Features

Tool Use
LLM evaluatorsretrieval modules (RAG)knowledge extractors
Frameworks
Iterative refine (SummIt)self-evaluation agents
Architectures
multi-agent summarizers (summarizer + evaluator loop)prompt-chaining (draft/critique/refine)chain-of-thought prompting

Optimization Features

Token Efficiency
chunking and hierarchical summarization to limit context
Infra Optimization
smaller LLMs vs large LLMs cost/quality trade-offs
Model Optimization
distillation to smaller models (InheritSumm, TriSum)LoRA
System Optimization
RAG to reduce hallucination and context length
Training Optimization
instruction fine-tuningcontrastive learning (SimCLS, BRIO)RL
Inference Optimization
prompt-chaining vs stepwise prompting trade-offssliding-window generation for long docs

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

public datasets listed (CNN/DM, XSum, PubMed, arXiv, MultiNews, QMSum, etc.)

Risks & Boundaries

Limitations

Survey synthesizes existing studies; it does not provide new empirical experiments beyond aggregated tables.

ROUGE-centric quantitative comparisons have known blind spots for LLM outputs.

When Not To Use

Do not rely on LLM-only summarizers for high-stakes medical, legal, or regulatory outputs without human verification.

Avoid trusting automatic ROUGE-only comparisons when choosing an LLM summary pipeline.

Failure Modes

Hallucination: inventing facts not in source.

Omission: missing important mid-document content due to position bias.

Core Entities

Models

BERTBARTT5PEGASUSLEDPRIMERAGPT-3GPT-3.5/ChatGPTGPT-4Claude-2LLaMa/Llama2Flan-T5

Metrics

ROUGEBERTScoreMoverScoreFactCCSummaCFEQA/QAGS (QA-based)GPTScoreG-Eval

Datasets

CNN/DMXSumNYTNEWSROOMGigawordCCSUMWikiHowReddit/TIFUSAMSumPubMedarXivMultiNewsQMSumDUCXl-Sum

Benchmarks

CNN/DMXSumMultiNewsQMSumDUCMiddleSumAggreFact (factuality datasets)