Fine-tune quantized LLMs on tokenized EHR histories to beat Med‑BERT and other baselines on diagnosis and readmission prediction

September 20, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

9

Authors

Ofir Ben Shoham, Nadav Rappoport

Links

Abstract / PDF

Why It Matters For Business

CPLLM shows you can repurpose public LLMs for EHR forecasting with no domain pretraining, achieving modest but consistent gains; this can speed deployment and lower data‑preparation costs.

Summary TLDR

This paper introduces CPLLM, a workflow that fine-tunes pre-trained LLMs (Llama2-13B and BioMedLM-2.7B) on structured EHR data turned into text prompts. Using quantized PEFT (QLoRA/LoRA) and a small number of added medical tokens, CPLLM outperformed baselines (Med‑BERT, RETAIN, logistic regression, PyHealth models) on three diagnosis tasks and readmission prediction (MIMIC‑IV and eICU‑CRD). The method needs no clinical pretraining, supports longer sequences (up to 4k tokens with Llama2), and is practical to run on a single GPU for fine-tuning.

Problem Statement

Can off‑the‑shelf LLMs be fine‑tuned (with quantization and prompt-style inputs) to predict future diagnoses and short-term hospital readmission from structured EHR sequences, without extra domain pretraining or visit-level timing?

Main Contribution

CPLLM: a prompt-based fine-tuning pipeline that encodes EHR diagnosis/procedure/drug sequences as text and trains LLMs for prediction.

Demonstration that quantized PEFT (QLoRA/LoRA) enables single‑GPU fine-tuning of large models (Llama2 and BioMedLM) for clinical prediction without clinical pretraining.

Empirical gains over Med‑BERT, RETAIN, logistic regression, and PyHealth baselines on diagnosis and readmission tasks; plus an ablation showing added domain tokens usually help.

Key Findings

CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.

NumbersPR-AUC 35.962% vs 35.050% (LogReg), +0.912% abs

CPLLM-Llama2 yields the largest PR-AUC gains for acute renal failure.

NumbersPR-AUC 45.442% vs 43.603% (RETAIN), +4.22% abs

CPLLM achieves top readmission prediction results on MIMIC‑IV and eICU‑CRD.

NumbersMIMIC PR-AUC 68.986% vs 67.523% (ConCare), +1.46% abs; eICU PR-AUC 94.115%

Adding domain tokens usually improves performance.

NumbersUnspecified renal failure: +0.499% PR-AUC (Llama2) and +1.631% PR-AUC (BioMedLM)

CPLLM fine-tuning can run on commodity GPU using quantized PEFT.

NumbersLlama2 fine-tune ~1 day on RTX6000; BioMedLM ~2 hours (13B vs 2.7B)

Results

PR-AUC

Value35.962%

BaselineLogistic Regression 35.050%

PR-AUC

Value45.442%

BaselineRETAIN 43.603%

PR-AUC

Value68.986%

BaselineConCare 67.523%

PR-AUC

Value94.115%

Baselinedeeper 93.814%

PR-AUC change from tokenizer tokens

Value+0.499%

Baselinewithout added tokens

Who Should Care

What To Try In 7 Days

Fine‑tune a 2–13B LLM on a small EHR slice using QLoRA/PEFT and your prompt template to validate PR‑AUC.

Add missing medical description tokens to the tokenizer and re-run to measure tokenization gains.

Compare CPLLM PR‑AUC to a logistic regression and one sequence baseline (RETAIN/Med‑BERT) on a key prediction task.

Optimization Features

Token Efficiency

  • Added domain tokens to tokenizer

Infra Optimization

  • Single‑GPU fine‑tuning demonstrated (RTX6000)

Model Optimization

  • LoRA

Training Optimization

  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Modest absolute PR‑AUC gains on some tasks; not all improvements are large.
  • Evaluation uses three runs; some variance (wide CIs) reported, especially for smaller BioMedLM in some settings.
  • Higher compute and inference cost versus lightweight baselines; Llama2 fine-tuning took ~1 day on RTX6000.

When Not To Use

  • When GPU/compute budget is very limited and simpler models already meet performance needs.
  • When extremely long patient sequences are the focus; datasets here lacked very long histories.
  • When explainability needs require transparent, simple models rather than LLMs.

Failure Modes

  • Overfitting or instability when dataset is small or class imbalance extreme (confidence intervals widen).
  • Prompt sensitivity: model requires prompt engineering and tokenizer changes that may not generalize.
  • Higher latency and cost during inference compared to compact clinical models.

Core Entities

Models

  • Llama2-13B
  • BioMedLM-2.7B (PubMedGPT)
  • Med-BERT
  • RETAIN
  • Logistic Regression
  • ConCare
  • deeper
  • GRASP

Metrics

  • PR-AUC
  • ROC-AUC

Datasets

  • MIMIC-IV v2.0
  • eICU-CRD

Benchmarks

  • PyHealth