Prompt-tuning GatorTronGPT-20B gives efficient, higher-scoring clinical dialogue summaries

March 19, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

1

Authors

Mengxian Lyu, Cheng Peng, Xiaohan Li, Patrick Balian, Jiang Bian, Yonghui Wu

Links

Abstract / PDF

Why It Matters For Business

Prompt tuning lets teams deploy clinical summarization with much lower compute and faster turnaround than full fine-tuning while often improving quality if you have a large domain LLM.

Summary TLDR

The authors use prompt tuning (trainable soft prompts) to steer clinical LLMs (GatorTronGPT-5B and -20B) to summarize doctor–patient dialogues into clinical notes. On the MTS-DIALOG benchmark, prompt-tuned GatorTronGPT-20B outperforms a fine-tuned T5-Large across ROUGE, BERTScore and BLEU (e.g., Rouge-1: 0.3628 vs 0.3425). Prompt tuning updates a small subset of parameters (70M–302M) while keeping the LLM frozen, cutting fine-tuning time (2h10m–4h23m vs 9h34m) and compute. Soft prompt length 128 tokens worked best. Few-shot tuning improves with sample size but still trails full-data performance.

Problem Statement

Clinical documentation is time-consuming and causes burnout. We need cost-efficient automatic summarization that produces concise, clinically useful notes from doctor–patient dialogues without expensive full-model fine-tuning.

Main Contribution

Designed prompt-tuning (soft prompts) pipeline for clinical dialogue summarization using GatorTronGPT.

Compared initialization strategies (LSTM vs MLP) and soft-prompt lengths; found LSTM and 128 virtual tokens best.

Benchmarked GatorTronGPT-5B and -20B against a fine-tuned T5-Large on MTS-DIALOG; GatorTronGPT-20B achieved best scores.

Measured compute and parameter-efficiency: prompt tuning updated far fewer parameters and ran faster than T5 fine-tuning.

Evaluated few-shot behavior: performance rises with more samples; 200 samples give decent but not full-data parity.

Key Findings

GatorTronGPT-20B prompt-tuned outperformed fine-tuned T5-Large on automatic metrics.

NumbersRouge-1 0.3628 vs 0.3425; BERTScore 0.7309 vs 0.6765 (Table 4)

Prompt tuning updates far fewer parameters and runs faster than full fine-tuning.

NumbersGatorTronGPT-20B prompt-tuning: 302M params, 4h23m vs T5 fine-tune: 770M params, 9h34m (Table 4)

Soft prompt length 128 tokens gave best average performance for both model sizes.

NumbersBest overall scores at virtual token size = 128 for both 5B and 20B (Table 3)

Few-shot prompt tuning improves with more examples but does not match full-data results.

NumbersRouge-1 rises from 0.139 (5 samples) to 0.3164 (200 samples) vs full-data 0.3628 (Table 6)

Prompt tuning can be parameter-efficient at small update sizes.

NumbersGatorTronGPT-5B prompt tuning updated 70M params and ran ~2h10m (Table 4)

Results

Rouge-1 (test)

Value0.3628

BaselineT5-Large 0.3425

BERTScore (test)

Value0.7309

BaselineT5-Large 0.6765

Trainable parameters (prompt/fine-tune)

ValueGatorTronGPT-20B: 302M; GatorTronGPT-5B: 70M; T5-Large: 770M

Training duration

ValueGatorTronGPT-20B: 4h23m; GatorTronGPT-5B: 2h10m; T5-Large: 9h34m

Few-shot Rouge-1

Value0.3164

Baselinefull-data 0.3628

Who Should Care

What To Try In 7 Days

Run soft-prompt tuning on an available clinical LLM; start with 128 virtual tokens and LSTM init.

Compare trainable-parameter budget and wall-clock time vs fine-tuning a small encoder-decoder model.

Validate summaries with a few dozen clinician checks and BERTScore/ROUGE to detect obvious gaps or hallucinations.

Optimization Features

Token Efficiency

  • Found 128 virtual tokens best for this task

Training Optimization

  • Prompt tuning (update soft prompts, freeze base weights)

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation relies on automatic metrics (ROUGE/BERTScore) that may not match clinician judgement.
  • Prompt tuning matched or beat fine-tuning only for large generative LLMs; smaller LLMs lag.
  • Results measured on a single benchmark (MTS-DIALOG) with specialty imbalance.
  • No human evaluation or clinical outcome validation reported.

When Not To Use

  • When you need verified, human-reviewed clinical summaries without automated checks.
  • If you lack access to a large pre-trained clinical LLM (20B+ scale) required for parity.
  • When strict reproducible open-source code or model release is required (not provided).

Failure Modes

  • Hallucinated facts or omitted critical clinical details not caught by n-gram metrics.
  • Performance drop in specialties underrepresented in MTS-DIALOG.
  • Sensitivity to virtual-token length and prompt initialization causing unstable results.

Core Entities

Models

  • GatorTronGPT-5B
  • GatorTronGPT-20B
  • T5-Large

Metrics

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • BLEU
  • BERTScore

Datasets

  • MTS-DIALOG