A practical review of how PLMs and LLMs drive biomedical text summarization and where they still fail

Overview

Decision SnapshotNeeds Validation

The survey compiles many published comparisons and provides tables of metric numbers, but most evaluation relies on ROUGE and limited factuality metrics; real-world readiness requires extra factual checks and more datasets.

Citations9

Evidence Strength0.75

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Qianqian Xie, Zheheng Luo, Benyou Wang, Sophia Ananiadou

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated summarization can cut clinician time and speed literature review, but current models still make factual errors; businesses should combine domain-adapted PLMs or LLM prompting with verification steps before clinical use.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This paper surveys recent methods, datasets, and evaluations for biomedical text summarization (BTS). It compares three ways of using pre-trained language models (PLMs)—feature-based, fine-tuning, and domain-adaptation—then reviews early uses of large language models (LLMs) for zero/few-shot and data augmentation. Key takeaways: PLMs and domain-adapted PLMs boost standard metrics; LLMs show strong zero-shot promise on some clinical tasks; major gaps remain in dataset coverage, long-input handling, and factual consistency of generated summaries.

Problem Statement

Biomedical text is growing fast (papers, EHRs, conversations) and clinicians need concise, accurate summaries. Existing BTS research before this paper lacked a focused review of methods based on modern PLMs and LLMs and of their special evaluation and factuality issues in the biomedical domain.

Main Contribution

Systematic review of BTS datasets, methods, and evaluation metrics that use PLMs and LLMs.

Taxonomy of how PLMs/LLMs are used: feature-based, fine-tuning, domain-adaptation, data-augmentation, zero-shot, and domain adaptation for LLMs.

Key Findings

Domain-adapted PLMs give the best extractive results on PubMed.

NumbersPubMed-short ROUGE-1: KeBioSum 43.98 vs TextRank 38.15

Practical UseIf you summarize long biomedical papers, fine-tune a domain-adapted PLM (PubMedBERT) rather than use off-the-shelf unsupervised methods.

Evidence RefTable 5

LLMs can match or beat supervised systems in some radiology summarization zero-shot tests.

NumbersOpenI ROUGE-1: ImpressionGPT (ChatGPT) 66.37 vs FactReranker 66.11

Practical UseTry prompting an LLM (ChatGPT/GPT-3 family) before heavy supervised training for radiology impressions to get strong baseline summaries quickly.

Evidence RefTable 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PubMed-short ROUGE-1	43.98	TextRank 38.15	+5.83	PubMed-short	Table 5: KeBioSum (domain-adapted PubMedBERT) vs TextRank	Table 5
OpenI ROUGE-1	66.37	FactReranker 66.11	+0.26	OpenI (radiology)	Table 8: ImpressionGPT (ChatGPT zero-shot) vs supervised FactReranker	Table 8

What To Try In 7 Days

Prompt an LLM (ChatGPT/GPT-3) on a small radiology set to get a quick zero-shot baseline.

Fine-tune a domain-adapted PLM (PubMedBERT or BioBART) on a small labeled subset and compare ROUGE and factual checks.

Run a simple factuality check (CheXbert or labeler) on generated radiology impressions and flag mismatches for human review.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/KenZLuo/Biomedical-TextSummarization-Survey/tree/master

Data URLs

https://github.com/armancohan/long-summarization (PubMed variants)https://github.com/allenai/cord19 (CORD-19)https://physionet.org/content/mimic-cxr/2.0.0/ (MIMIC-CXR)https://openi.nlm.nih.gov/ (OpenI)

Risks & Boundaries

Limitations

Public datasets skew toward biomedical literature; EHRs and conversations are limited and small.

PLMs/LLMs struggle with very long documents because most models truncate at ~512 tokens.

When Not To Use

Do not deploy abstractive summaries for clinical decisions without human verification.

Avoid relying solely on ROUGE/BERTScore to judge clinical summary quality.

Failure Modes

Hallucinated facts or incorrect direction of effect in study summaries.

Missing important content due to input truncation.

Core Entities

Models

BERTBioBERTPubMedBERTRoBERTaSciBERTBARTT5PEGASUSLongformer/LEDGPT-2GPT-3ChatGPTRadBERTBioBARTCLIN-T5

Metrics

ROUGEBERTScoreΔEIfactual F1CheXbertreadability indices (Flesch, Gunning fog, Coleman-Liau)

Datasets

PubMedSumPubMedS2ORCCORD-19PubMedCiteCDSRPLOSRCTMS^2MIMIC-CXROpenIHET-MCMeQSumCHQ-Summ

Benchmarks

MEDIQA 2021 shared taskMS^2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Domain-adapted PLMs give the best extractive results on PubMed.

LLMs can match or beat supervised systems in some radiology summarization zero-shot tests.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding