Overview
The survey compiles many published comparisons and provides tables of metric numbers, but most evaluation relies on ROUGE and limited factuality metrics; real-world readiness requires extra factual checks and more datasets.
Citations9
Evidence Strength0.75
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Automated summarization can cut clinician time and speed literature review, but current models still make factual errors; businesses should combine domain-adapted PLMs or LLM prompting with verification steps before clinical use.
Who Should Care
Summary TLDR
This paper surveys recent methods, datasets, and evaluations for biomedical text summarization (BTS). It compares three ways of using pre-trained language models (PLMs)—feature-based, fine-tuning, and domain-adaptation—then reviews early uses of large language models (LLMs) for zero/few-shot and data augmentation. Key takeaways: PLMs and domain-adapted PLMs boost standard metrics; LLMs show strong zero-shot promise on some clinical tasks; major gaps remain in dataset coverage, long-input handling, and factual consistency of generated summaries.
Problem Statement
Biomedical text is growing fast (papers, EHRs, conversations) and clinicians need concise, accurate summaries. Existing BTS research before this paper lacked a focused review of methods based on modern PLMs and LLMs and of their special evaluation and factuality issues in the biomedical domain.
Main Contribution
Systematic review of BTS datasets, methods, and evaluation metrics that use PLMs and LLMs.
Taxonomy of how PLMs/LLMs are used: feature-based, fine-tuning, domain-adaptation, data-augmentation, zero-shot, and domain adaptation for LLMs.
Key Findings
Domain-adapted PLMs give the best extractive results on PubMed.
LLMs can match or beat supervised systems in some radiology summarization zero-shot tests.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| PubMed-short ROUGE-1 | 43.98 | TextRank 38.15 | +5.83 | PubMed-short | Table 5: KeBioSum (domain-adapted PubMedBERT) vs TextRank | Table 5 |
| OpenI ROUGE-1 | 66.37 | FactReranker 66.11 | +0.26 | OpenI (radiology) | Table 8: ImpressionGPT (ChatGPT zero-shot) vs supervised FactReranker | Table 8 |
What To Try In 7 Days
Prompt an LLM (ChatGPT/GPT-3) on a small radiology set to get a quick zero-shot baseline.
Fine-tune a domain-adapted PLM (PubMedBERT or BioBART) on a small labeled subset and compare ROUGE and factual checks.
Run a simple factuality check (CheXbert or labeler) on generated radiology impressions and flag mismatches for human review.
Reproducibility
Risks & Boundaries
Limitations
Public datasets skew toward biomedical literature; EHRs and conversations are limited and small.
PLMs/LLMs struggle with very long documents because most models truncate at ~512 tokens.
When Not To Use
Do not deploy abstractive summaries for clinical decisions without human verification.
Avoid relying solely on ROUGE/BERTScore to judge clinical summary quality.
Failure Modes
Hallucinated facts or incorrect direction of effect in study summaries.
Missing important content due to input truncation.

