A concise, up-to-date roadmap of text summarization research before and during the LLM era

Overview

Decision SnapshotReady For Pilot

This survey compiles existing evidence and provides practical guidance; it is broadly useful but primarily synthesizes prior work rather than introducing new experiments.

Citations11

Evidence Strength0.75

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 30%

Authors

Haopeng Zhang, Philip S. Yu, Jiawei Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs let teams deploy usable summaries quickly with zero/few-shot prompts, but hallucination and unreliable automatic metrics mean businesses must pair LLMs with retrieval, human checks, or smaller fine-tuned models for safety.

Who Should Care

Product Manager CTO ML Engineer Data Scientist

Summary TLDR

This is a 30-page survey that traces text summarization from classic statistical methods through neural and PLM fine-tuning to the current LLM era. It catalogs datasets, evaluation metrics, pre-LLM methods (statistical, deep learning, PLMs) and LLM-era work (benchmarking, modeling, and evaluation). The authors highlight that LLMs enable strong zero-/few-shot summarization but expose evaluation, factuality, bias, and efficiency gaps. The paper synthesizes trends, open challenges (hallucination, bias, compute, personalization, interpretability) and practical future directions (multimodal, domain LLMs, human-in-loop).

Problem Statement

Summarization methods have evolved rapidly (statistical → deep learning → PLM fine-tuning → LLMs). This survey asks: what changed, how do we evaluate modern summarizers, what are gaps introduced by LLMs, and what practical directions help researchers and engineers adapt?

Main Contribution

Comprehensive survey covering summarization across four paradigms: statistical, deep learning, PLM fine-tuning, and LLMs.

First focused taxonomy and organized review of LLM-based summarization literature (benchmarks, modeling, evaluation).

Key Findings

LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.

NumbersHuman studies report annotator preference for GPT-3/GPT-4 summaries (multiple papers cited)

Practical UseTry off-the-shelf LLM prompts for quick prototyping, but verify quality with human checks because automatic metrics can disagree.

Evidence Ref[69, 270, 172]

Automatic metrics like ROUGE can misjudge LLM outputs and understate human-level quality.

NumbersSurvey notes ROUGE remains standard but often misaligned; examples show LLMs score lower on ROUGE despite human pref.

Practical UseUse human evaluation or LLM-based multi-aspect evaluators alongside ROUGE when comparing LLM summarizers.

Evidence RefTable 6; [69, 269]

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-1 (CNN/DM)	SimCLS 46.67; SliSum (Claude2) 47.75; BART 44.16; PEGASUS 44.17	TextRank 33.2 (2004)	Modern PLMs & contrastive methods improve R1 ≈ +11–14 vs early unsupervised	CNN/DM	Table 6 in paper lists per-model ROUGE scores	Table 6
LLM zero/few-shot human preference	Human studies report annotator preference for GPT-3/GPT-4 summaries over baselines	Fine-tuned supervised systems (various)	Human preference despite lower automatic scores	News benchmarks (CNN/DM, XSum)	Benchmarking studies summarized in Section 4.1	[69, 270, 172]

What To Try In 7 Days

Run prompt-based zero-shot summaries with an LLM on your domain and compare with a small fine-tuned PLM using human spot-checks.

Add simple retrieval (RAG) to the prompt to reduce hallucinations for factual content.

Evaluate outputs with an LLM-based evaluator (GPTScore/G-Eval) and a small human panel to identify metric gaps.

Agent Features

Tool Use

LLM evaluatorsretrieval modules (RAG)knowledge extractors

Frameworks

Iterative refine (SummIt)self-evaluation agents

Architectures

multi-agent summarizers (summarizer + evaluator loop)prompt-chaining (draft/critique/refine)chain-of-thought prompting

Optimization Features

Token Efficiency

chunking and hierarchical summarization to limit context

Infra Optimization

smaller LLMs vs large LLMs cost/quality trade-offs

Model Optimization

distillation to smaller models (InheritSumm, TriSum)LoRA

System Optimization

RAG to reduce hallucination and context length

Training Optimization

instruction fine-tuningcontrastive learning (SimCLS, BRIO)RL

Inference Optimization

prompt-chaining vs stepwise prompting trade-offssliding-window generation for long docs

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

public datasets listed (CNN/DM, XSum, PubMed, arXiv, MultiNews, QMSum, etc.)

Risks & Boundaries

Limitations

Survey synthesizes existing studies; it does not provide new empirical experiments beyond aggregated tables.

ROUGE-centric quantitative comparisons have known blind spots for LLM outputs.

When Not To Use

Do not rely on LLM-only summarizers for high-stakes medical, legal, or regulatory outputs without human verification.

Avoid trusting automatic ROUGE-only comparisons when choosing an LLM summary pipeline.

Failure Modes

Hallucination: inventing facts not in source.

Omission: missing important mid-document content due to position bias.

Core Entities

Models

BERTBARTT5PEGASUSLEDPRIMERAGPT-3GPT-3.5/ChatGPTGPT-4Claude-2LLaMa/Llama2Flan-T5

Metrics

ROUGEBERTScoreMoverScoreFactCCSummaCFEQA/QAGS (QA-based)GPTScoreG-Eval

Datasets

CNN/DMXSumNYTNEWSROOMGigawordCCSUMWikiHowReddit/TIFUSAMSumPubMedarXivMultiNewsQMSumDUCXl-Sum

Benchmarks

CNN/DMXSumMultiNewsQMSumDUCMiddleSumAggreFact (factuality datasets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.

Automatic metrics like ROUGE can misjudge LLM outputs and understate human-level quality.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding