Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
Prompt tuning lets teams deploy clinical summarization with much lower compute and faster turnaround than full fine-tuning while often improving quality if you have a large domain LLM.
Summary TLDR
The authors use prompt tuning (trainable soft prompts) to steer clinical LLMs (GatorTronGPT-5B and -20B) to summarize doctor–patient dialogues into clinical notes. On the MTS-DIALOG benchmark, prompt-tuned GatorTronGPT-20B outperforms a fine-tuned T5-Large across ROUGE, BERTScore and BLEU (e.g., Rouge-1: 0.3628 vs 0.3425). Prompt tuning updates a small subset of parameters (70M–302M) while keeping the LLM frozen, cutting fine-tuning time (2h10m–4h23m vs 9h34m) and compute. Soft prompt length 128 tokens worked best. Few-shot tuning improves with sample size but still trails full-data performance.
Problem Statement
Clinical documentation is time-consuming and causes burnout. We need cost-efficient automatic summarization that produces concise, clinically useful notes from doctor–patient dialogues without expensive full-model fine-tuning.
Main Contribution
Designed prompt-tuning (soft prompts) pipeline for clinical dialogue summarization using GatorTronGPT.
Compared initialization strategies (LSTM vs MLP) and soft-prompt lengths; found LSTM and 128 virtual tokens best.
Benchmarked GatorTronGPT-5B and -20B against a fine-tuned T5-Large on MTS-DIALOG; GatorTronGPT-20B achieved best scores.
Measured compute and parameter-efficiency: prompt tuning updated far fewer parameters and ran faster than T5 fine-tuning.
Evaluated few-shot behavior: performance rises with more samples; 200 samples give decent but not full-data parity.
Key Findings
GatorTronGPT-20B prompt-tuned outperformed fine-tuned T5-Large on automatic metrics.
Prompt tuning updates far fewer parameters and runs faster than full fine-tuning.
Soft prompt length 128 tokens gave best average performance for both model sizes.
Few-shot prompt tuning improves with more examples but does not match full-data results.
Prompt tuning can be parameter-efficient at small update sizes.
Results
Rouge-1 (test)
BERTScore (test)
Trainable parameters (prompt/fine-tune)
Training duration
Few-shot Rouge-1
Who Should Care
What To Try In 7 Days
Run soft-prompt tuning on an available clinical LLM; start with 128 virtual tokens and LSTM init.
Compare trainable-parameter budget and wall-clock time vs fine-tuning a small encoder-decoder model.
Validate summaries with a few dozen clinician checks and BERTScore/ROUGE to detect obvious gaps or hallucinations.
Optimization Features
Token Efficiency
- Found 128 virtual tokens best for this task
Training Optimization
- Prompt tuning (update soft prompts, freeze base weights)
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluation relies on automatic metrics (ROUGE/BERTScore) that may not match clinician judgement.
- Prompt tuning matched or beat fine-tuning only for large generative LLMs; smaller LLMs lag.
- Results measured on a single benchmark (MTS-DIALOG) with specialty imbalance.
- No human evaluation or clinical outcome validation reported.
When Not To Use
- When you need verified, human-reviewed clinical summaries without automated checks.
- If you lack access to a large pre-trained clinical LLM (20B+ scale) required for parity.
- When strict reproducible open-source code or model release is required (not provided).
Failure Modes
- Hallucinated facts or omitted critical clinical details not caught by n-gram metrics.
- Performance drop in specialties underrepresented in MTS-DIALOG.
- Sensitivity to virtual-token length and prompt initialization causing unstable results.
Core Entities
Models
- GatorTronGPT-5B
- GatorTronGPT-20B
- T5-Large
Metrics
- ROUGE-1
- ROUGE-2
- ROUGE-L
- BLEU
- BERTScore
Datasets
- MTS-DIALOG

