Soft prompts + frozen large LLMs are parameter‑efficient and better for cross‑site and few‑shot clinical extraction.

October 10, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.45

Cost Impact Score

0.65

Citation Count

1

Authors

Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, Yonghui Wu

Links

Abstract / PDF

Why It Matters For Business

Tuning only soft prompts on a frozen billion‑scale clinical LLM cuts compute and deployment costs while keeping or improving cross‑site and few‑shot extraction accuracy.

Summary TLDR

The authors build a soft prompt (learnable continuous prompt) MRC model and compare four training strategies (no prompt, hard prompt unfrozen, soft prompt unfrozen, soft prompt frozen) across seven encoder LLMs and two n2c2 clinical datasets. Soft prompting usually beats hard prompts and traditional fine‑tuning. Freezing the LLM and only tuning soft prompts is parameter‑efficient and improves transfer and few‑shot generalization, but this only holds when the base model is large (billions of parameters). Small frozen models lose several F1 points.

Problem Statement

Clinical entity and relation extraction still depends on costly model tuning and manual prompt design. The paper asks whether learnable soft prompts plus frozen or unfrozen LLMs can reduce tuning cost while keeping or improving extraction accuracy and cross‑site/few‑shot performance.

Main Contribution

Introduced a soft‑prompt based machine‑reading‑comprehension (MRC) architecture for clinical concept and relation extraction.

Systematic comparison of four strategies: no prompt (fine‑tune), hard prompt (unfrozen), soft prompt (unfrozen), soft prompt (frozen).

Benchmarked seven encoder LLMs (345M to 8.9B params) on two n2c2 datasets (2018 drug‑ADE and 2022 SDoH) with strict micro F1.

Measured transfer (cross‑institution) and few‑shot behavior and did prompt‑length ablation.

Showed frozen prompt‑tuning is parameter efficient (2.5–6% params updated) and competitive only at billion‑scale.

Key Findings

Soft prompting with an unfrozen GatorTron-3.9B gave best concept extraction on drug‑ADE.

Numbersstrict F1 = 0.9118

Frozen billion‑scale models reach near‑parity with unfrozen tuning for concept extraction.

NumbersF1 ≈ 0.9085–0.9093 for GatorTron‑3.9B/8.9B when frozen

Soft prompting improves few‑shot and cross‑institution generalization, especially when the base model is large.

NumbersCross‑site concept F1 (MIMIC→UW) up to 0.8297; few‑shot F1 with 100 samples ~0.816 and 0.715 (concepts, relations)

Freezing small LLMs significantly hurts performance versus unfrozen tuning.

NumbersPerformance drop about 3.8%–5.9% for ≤345M models

Soft prompts handle nested/overlapped entities better than standard fine‑tuning in some cases.

NumbersExample: soft prompting (GatorTron‑345M) identified 39% and 82% overlapped/nested vs 28% and 9% for fine‑tuning

Results

Concept extraction (drug-ADE)

ValueF1=0.9118 (GatorTron-3.9B, soft prompt unfrozen)

BaselineNo prompt unfrozen F1=0.8883

Concept extraction (SDoH)

ValueF1=0.8610 (GatorTron-8.9B, soft prompt unfrozen)

BaselineNo prompt unfrozen F1=0.8388

End‑to‑end relation extraction (drug-ADE)

ValueF1=0.8332 (GatorTron-345M, soft prompt unfrozen)

BaselineNo prompt unfrozen F1=0.8192

End‑to‑end relation extraction (SDoH)

ValueF1=0.7488 (GatorTron-345M, soft prompt unfrozen)

BaselineNo prompt unfrozen F1=0.6395

Who Should Care

What To Try In 7 Days

Run a 2‑arm test: soft prompt frozen vs standard fine‑tune on your best available clinical model.

If you have ≥3B params access, train only soft prompts and measure strict F1 and cross‑site drop.

Experiment with prompt lengths (try 32 and 64 tokens) — performance varies 1–2%.

Optimization Features

Token Efficiency

  • Soft prompts are vector tokens added to embeddings (no extra tokenization)

Infra Optimization

  • Works with 8x A100‑80G GPUs in experiments

Model Optimization

  • Use frozen LLM + soft prompts to avoid weight updates
  • Prompt length tuning affects performance (32–64 preferred)

System Optimization

  • Frozen model deployment enables one model for multiple tasks

Training Optimization

  • Update only 2.5–6% of parameters when prompt‑tuning (paper reference)
  • Five‑fold cross‑validation for hyperparameter selection

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only encoder (BERT‑style) LLMs were examined; decoder/generative models not tested.
  • Experiments use clinical GatorTron family and may not generalize to all LLMs.
  • Frozen prompt‑tuning needs billion‑scale models to match unfrozen accuracy.
  • Computational cost remains high for training large LLMs (A100 GPUs used).

When Not To Use

  • Do not freeze and only prompt‑tune if your base model is small (≤345M): performance drops several F1 points.
  • Avoid relying on frozen soft prompts when you can afford full fine‑tuning and need marginal top accuracy in‑domain.

Failure Modes

  • Frozen small models underperform compared to unfrozen tuning (3.8–5.9% F1 drop).
  • Unfrozen tuning can overfit to institution data and hurt cross‑site performance.
  • Soft prompt length sensitivity: too short or too long can lose 1–2% F1.

Core Entities

Models

  • BERT
  • BERT-MIMIC
  • RoBERTa
  • RoBERTa-MIMIC
  • GatorTron-345M
  • GatorTron-3.9B
  • GatorTron-8.9B

Metrics

  • strict micro‑averaged F1-score

Datasets

  • 2018 n2c2 drug-ADE (MIMIC)
  • 2022 n2c2 SDoH (MIMIC, UW)

Benchmarks

  • n2c2 2018 medication-ADE
  • n2c2 2022 SDoH