Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
9
Why It Matters For Business
Automated summarization can cut clinician time and speed literature review, but current models still make factual errors; businesses should combine domain-adapted PLMs or LLM prompting with verification steps before clinical use.
Summary TLDR
This paper surveys recent methods, datasets, and evaluations for biomedical text summarization (BTS). It compares three ways of using pre-trained language models (PLMs)—feature-based, fine-tuning, and domain-adaptation—then reviews early uses of large language models (LLMs) for zero/few-shot and data augmentation. Key takeaways: PLMs and domain-adapted PLMs boost standard metrics; LLMs show strong zero-shot promise on some clinical tasks; major gaps remain in dataset coverage, long-input handling, and factual consistency of generated summaries.
Problem Statement
Biomedical text is growing fast (papers, EHRs, conversations) and clinicians need concise, accurate summaries. Existing BTS research before this paper lacked a focused review of methods based on modern PLMs and LLMs and of their special evaluation and factuality issues in the biomedical domain.
Main Contribution
Systematic review of BTS datasets, methods, and evaluation metrics that use PLMs and LLMs.
Taxonomy of how PLMs/LLMs are used: feature-based, fine-tuning, domain-adaptation, data-augmentation, zero-shot, and domain adaptation for LLMs.
Comparison of methods on public biomedical benchmarks and a focused discussion of limitations and future directions.
Curated public resources (dataset and code links) in a companion GitHub repository.
Key Findings
Domain-adapted PLMs give the best extractive results on PubMed.
LLMs can match or beat supervised systems in some radiology summarization zero-shot tests.
Factual correctness remains weak for abstractive BTS.
Public datasets are imbalanced: many literature corpora but few EHR/conversation corpora.
Results
PubMed-short ROUGE-1
OpenI ROUGE-1
MS^2 PICO correctness
Who Should Care
What To Try In 7 Days
Prompt an LLM (ChatGPT/GPT-3) on a small radiology set to get a quick zero-shot baseline.
Fine-tune a domain-adapted PLM (PubMedBERT or BioBART) on a small labeled subset and compare ROUGE and factual checks.
Run a simple factuality check (CheXbert or labeler) on generated radiology impressions and flag mismatches for human review.
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Public datasets skew toward biomedical literature; EHRs and conversations are limited and small.
- PLMs/LLMs struggle with very long documents because most models truncate at ~512 tokens.
- Abstractive models commonly produce factual errors; factual metrics are immature.
- Model choice matters: domain-specific pretraining often helps but is not a cure.
When Not To Use
- Do not deploy abstractive summaries for clinical decisions without human verification.
- Avoid relying solely on ROUGE/BERTScore to judge clinical summary quality.
- Don't expect off-the-shelf biomedical PLMs to perform well on radiology notes without domain adaptation.
Failure Modes
- Hallucinated facts or incorrect direction of effect in study summaries.
- Missing important content due to input truncation.
- Dataset mismatch causing poor generalization across document types.
- Evaluation metrics that fail to catch factual errors or clinical risk
Core Entities
Models
- BERT
- BioBERT
- PubMedBERT
- RoBERTa
- SciBERT
- BART
- T5
- PEGASUS
- Longformer/LED
- GPT-2
- GPT-3
- ChatGPT
- RadBERT
- BioBART
- CLIN-T5
Metrics
- ROUGE
- BERTScore
- ΔEI
- factual F1
- CheXbert
- readability indices (Flesch, Gunning fog, Coleman-Liau)
Datasets
- PubMed
- SumPubMed
- S2ORC
- CORD-19
- PubMedCite
- CDSR
- PLOS
- RCT
- MS^2
- MIMIC-CXR
- OpenI
- HET-MC
- MeQSum
- CHQ-Summ
Benchmarks
- MEDIQA 2021 shared task
- MS^2

