Clinical LLM trained on hospital notes shows large generalization gaps across hospitals, ages, and comorbidity levels

February 14, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

8

Authors

Salman Rahman, Lavender Yao Jiang, Saadia Gabriel, Yindalon Aphinyanaphongs, Eric Karl Oermann, Rumi Chunara

Links

Abstract / PDF

Why It Matters For Business

A clinical LLM that does not generalize across hospitals or patient groups risks wrong predictions, worse care, and financial penalties for readmissions; small local fine-tuning often yields the best improvement for underperforming sites.

Summary TLDR

The authors evaluate ClinicLLM — a BERT-base clinical language model pretrained on one hospital system and fine-tuned to predict 30-day all-cause readmission. Temporal baseline AUC was 73.6%. Performance drops strongly on small hospitals (Hospital 4 AUC 51.2%), older patients (Above 60 AUC 64.8%), and high comorbidity patients (Level 4 AUC 58.1%). Local hospital-specific fine-tuning is the most reliable fix (AUC gains up to 11.74% at the worst hospital). Instance-based augmentation helps random-split tests but harms temporal (deployment-like) performance. Cluster-based fine-tuning generally did not help.

Problem Statement

Clinical LLMs often perform well in development data but may fail when used in different hospitals or on different patient groups. This paper asks which factors cause those generalization gaps and which fine-tuning strategies actually improve real-world (temporal) performance.

Main Contribution

Systematic evaluation of ClinicLLM on 30-day readmission across four hospitals and multiple patient groups.

Analysis of drivers of poor generalization: sample size, patient age, comorbidity, and note length.

Comparison of three fine-tuning strategies: local (hospital-specific), instance-based augmentation, and cluster-based; local fine-tuning worked best on temporal tests.

Key Findings

Temporal baseline performance (global fine-tune) AUC = 73.60%.

NumbersAUC = 73.60% (temporal test)

Large hospital-level gaps: Hospital 3 AUC = 69.90%, Hospital 4 AUC = 51.20%.

NumbersHospital3 69.90%, Hospital4 51.20%

Older patients and high comorbidity groups have much lower discrimination.

NumbersAge>60 AUC 64.75%; Comorbidity Level 4 AUC 58.08%

Local hospital-specific fine-tuning improved temporal AUC for low-data sites.

NumbersHospital4 AUC +11.74% (to 57.21%); Hospital3 +2.39%

Instance-based augmentation helps random-split tests but reduces temporal performance.

NumbersRandom AUC gain ~+5.5%; Temporal proportional change −2.86% (Hosp3), −2.38% (Hosp4)

Cluster-based fine-tuning mostly decreased temporal performance.

NumbersProportional AUC change −6.18% (Hosp1), −8.93% (Hosp2), +0.53% (Hosp3)

Results

Temporal baseline AUC (global fine-tune)

Value73.60%

Random-split baseline AUC (global fine-tune)

Value76.90%

Hospital 4 AUC (global fine-tune)

Value51.20%

BaselineGlobal temporal AUC 73.60%

Age >60 AUC

Value64.75%

BaselineGlobal temporal AUC 73.60%

Comorbidity Level 4 AUC

Value58.08%

BaselineGlobal temporal AUC 73.60%

Local fine-tuning improvement (Hospital 4)

ValueAUC to 57.21%

BaselineGlobal Hospital 4 temporal AUC 51.20%

Instance-augmented temporal change (Hospital 3)

ValueProportional AUC change −2.86%

BaselineHospital 3 global fine-tune

Cluster-based fine-tuning (Hospital 2)

ValueAUC 66.52%

BaselineGlobal Hospital 2 AUC 73.04%

Who Should Care

What To Try In 7 Days

Compute temporal AUC and ECE per hospital and key subgroups (age, comorbidity, insurance).

If a site underperforms, run a short local fine-tune on ~3k–50k local notes and re-evaluate temporally.

Inspect note lengths and comorbidity distribution to prioritize groups needing extra validation or data collection.

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Single health system (four hospitals) — results may not hold in other systems.
  • Only History & Physical notes were used; other note types were excluded.
  • Model is BERT-base with 512-token limit, so long-note information is truncated.
  • No in-hospital pretraining or adversarial training was performed.
  • Hospital names redacted; replication requires institutional access to EHR data.

When Not To Use

  • If your deployment covers very different hospitals or outpatient clinics without local data.
  • When clinical notes are much longer than 512 tokens and truncation loses key info.
  • If you cannot run temporal (future) validation to detect drift.

Failure Modes

  • Poor performance at small hospitals or low-sample sites.
  • Worse discrimination for older patients and those with many comorbidities.
  • Instance-augmented data can overfit to training timeframe and harm temporal generalization.
  • Cluster-based fine-tuning may reduce true-positive detection in some hospitals.

Core Entities

Models

  • ClinicLLM (BERT-base, 109M, masked LM pretrained on hospital notes)
  • BERT-base (architecture reference)

Metrics

  • AUC
  • AUPR
  • ECE
  • Perplexity

Datasets

  • Pretraining: ~7,247,694 clinical notes (~4.1B words) from four hospitals
  • Fine-tuning: H&P notes, 222,824 notes, 170,191 patients (Dec 2012–Dec 2021)

Benchmarks

  • 30-day all-cause readmission prediction (note-level)