A concise, up-to-date roadmap of text summarization research before and during the LLM era

June 17, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.3

Cost Impact Score

0.8

Citation Count

11

Authors

Haopeng Zhang, Philip S. Yu, Jiawei Zhang

Links

Abstract / PDF

Why It Matters For Business

LLMs let teams deploy usable summaries quickly with zero/few-shot prompts, but hallucination and unreliable automatic metrics mean businesses must pair LLMs with retrieval, human checks, or smaller fine-tuned models for safety.

Summary TLDR

This is a 30-page survey that traces text summarization from classic statistical methods through neural and PLM fine-tuning to the current LLM era. It catalogs datasets, evaluation metrics, pre-LLM methods (statistical, deep learning, PLMs) and LLM-era work (benchmarking, modeling, and evaluation). The authors highlight that LLMs enable strong zero-/few-shot summarization but expose evaluation, factuality, bias, and efficiency gaps. The paper synthesizes trends, open challenges (hallucination, bias, compute, personalization, interpretability) and practical future directions (multimodal, domain LLMs, human-in-loop).

Problem Statement

Summarization methods have evolved rapidly (statistical → deep learning → PLM fine-tuning → LLMs). This survey asks: what changed, how do we evaluate modern summarizers, what are gaps introduced by LLMs, and what practical directions help researchers and engineers adapt?

Main Contribution

Comprehensive survey covering summarization across four paradigms: statistical, deep learning, PLM fine-tuning, and LLMs.

First focused taxonomy and organized review of LLM-based summarization literature (benchmarks, modeling, evaluation).

Curated tables of datasets, metrics, representative methods, and CNN/DM quantitative comparisons plus discussion of open challenges and future directions.

Key Findings

LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.

NumbersHuman studies report annotator preference for GPT-3/GPT-4 summaries (multiple papers cited)

Automatic metrics like ROUGE can misjudge LLM outputs and understate human-level quality.

NumbersSurvey notes ROUGE remains standard but often misaligned; examples show LLMs score lower on ROUGE despite human pref.

Factuality and hallucination remain major open problems for LLM summarization.

NumbersMultiple factuality benchmarks report poor LLM inconsistency detection; some models perform near chance on detecting un/

Position (lead) bias is persistent: models prefer start/end of text and ignore middle content.

NumbersStudies report U-shaped utilization with neglect of middle segments; position bias observed across datasets

PLM fine-tuning and PLM-specific pretraining delivered steady ROUGE gains on CNN/DM before LLMs.

NumbersBART/PEGASUS ≈ ROUGE-1 44.16/44.17; SimCLS 46.67 on CNN/DM (Table 6)

Results

ROUGE-1 (CNN/DM)

ValueSimCLS 46.67; SliSum (Claude2) 47.75; BART 44.16; PEGASUS 44.17

BaselineTextRank 33.2 (2004)

LLM zero/few-shot human preference

ValueHuman studies report annotator preference for GPT-3/GPT-4 summaries over baselines

BaselineFine-tuned supervised systems (various)

Position bias (coverage pattern)

ValueU-shaped usage: models favor start and end, ignore middle

Who Should Care

What To Try In 7 Days

Run prompt-based zero-shot summaries with an LLM on your domain and compare with a small fine-tuned PLM using human spot-checks.

Add simple retrieval (RAG) to the prompt to reduce hallucinations for factual content.

Evaluate outputs with an LLM-based evaluator (GPTScore/G-Eval) and a small human panel to identify metric gaps.

Agent Features

Tool Use

  • LLM evaluators
  • retrieval modules (RAG)
  • knowledge extractors

Frameworks

  • Iterative refine (SummIt)
  • self-evaluation agents

Architectures

  • multi-agent summarizers (summarizer + evaluator loop)
  • prompt-chaining (draft/critique/refine)
  • chain-of-thought prompting

Optimization Features

Token Efficiency

  • chunking and hierarchical summarization to limit context

Infra Optimization

  • smaller LLMs vs large LLMs cost/quality trade-offs

Model Optimization

  • distillation to smaller models (InheritSumm, TriSum)
  • LoRA

System Optimization

  • RAG to reduce hallucination and context length

Training Optimization

  • instruction fine-tuning
  • contrastive learning (SimCLS, BRIO)
  • RL

Inference Optimization

  • prompt-chaining vs stepwise prompting trade-offs
  • sliding-window generation for long docs

Reproducibility

Data Urls

  • public datasets listed (CNN/DM, XSum, PubMed, arXiv, MultiNews, QMSum, etc.)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey synthesizes existing studies; it does not provide new empirical experiments beyond aggregated tables.
  • ROUGE-centric quantitative comparisons have known blind spots for LLM outputs.
  • Coverage of very recent closed-source LLM internals and proprietary benchmarks is limited by available literature.

When Not To Use

  • Do not rely on LLM-only summarizers for high-stakes medical, legal, or regulatory outputs without human verification.
  • Avoid trusting automatic ROUGE-only comparisons when choosing an LLM summary pipeline.

Failure Modes

  • Hallucination: inventing facts not in source.
  • Omission: missing important mid-document content due to position bias.
  • Bias: underrepresenting perspectives or amplifying dataset biases.
  • Oververbosity or verbosity mismatch when concise summaries are required.
  • Metric mismatch: automatic scores diverge from human judgment.

Core Entities

Models

  • BERT
  • BART
  • T5
  • PEGASUS
  • LED
  • PRIMERA
  • GPT-3
  • GPT-3.5/ChatGPT
  • GPT-4
  • Claude-2
  • LLaMa/Llama2
  • Flan-T5

Metrics

  • ROUGE
  • BERTScore
  • MoverScore
  • FactCC
  • SummaC
  • FEQA/QAGS (QA-based)
  • GPTScore
  • G-Eval

Datasets

  • CNN/DM
  • XSum
  • NYT
  • NEWSROOM
  • Gigaword
  • CCSUM
  • WikiHow
  • Reddit/TIFU
  • SAMSum
  • PubMed
  • arXiv
  • MultiNews
  • QMSum
  • DUC
  • Xl-Sum

Benchmarks

  • CNN/DM
  • XSum
  • MultiNews
  • QMSum
  • DUC
  • MiddleSum
  • AggreFact (factuality datasets)