Fine-tune quantized LLMs on tokenized EHR histories to beat Med‑BERT and other baselines on diagnosis and readmission prediction

September 20, 20236 min

Overview

Decision SnapshotNeeds Validation

Results show consistent PR‑AUC and ROC‑AUC improvements on two public datasets, but gains are modest, experiments are limited to three runs, and compute costs for large LLMs are nontrivial.

Citations9

Evidence Strength0.70

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 50%

Authors

Ofir Ben Shoham, Nadav Rappoport

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CPLLM shows you can repurpose public LLMs for EHR forecasting with no domain pretraining, achieving modest but consistent gains; this can speed deployment and lower data‑preparation costs.

Who Should Care

Summary TLDR

This paper introduces CPLLM, a workflow that fine-tunes pre-trained LLMs (Llama2-13B and BioMedLM-2.7B) on structured EHR data turned into text prompts. Using quantized PEFT (QLoRA/LoRA) and a small number of added medical tokens, CPLLM outperformed baselines (Med‑BERT, RETAIN, logistic regression, PyHealth models) on three diagnosis tasks and readmission prediction (MIMIC‑IV and eICU‑CRD). The method needs no clinical pretraining, supports longer sequences (up to 4k tokens with Llama2), and is practical to run on a single GPU for fine-tuning.

Problem Statement

Can off‑the‑shelf LLMs be fine‑tuned (with quantization and prompt-style inputs) to predict future diagnoses and short-term hospital readmission from structured EHR sequences, without extra domain pretraining or visit-level timing?

Main Contribution

CPLLM: a prompt-based fine-tuning pipeline that encodes EHR diagnosis/procedure/drug sequences as text and trains LLMs for prediction.

Demonstration that quantized PEFT (QLoRA/LoRA) enables single‑GPU fine-tuning of large models (Llama2 and BioMedLM) for clinical prediction without clinical pretraining.

Key Findings

CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.

NumbersPR-AUC 35.962% vs 35.050% (LogReg), +0.912% abs

Practical UseIf you fine-tune Llama2 with CPLLM prompts on similar EHR data, expect small but measurable PR-AUC gains over simple baselines on this task.

Evidence RefTable 2; section 3.2.1

CPLLM-Llama2 yields the largest PR-AUC gains for acute renal failure.

NumbersPR-AUC 45.442% vs 43.603% (RETAIN), +4.22% abs

Practical UseFor some diagnosis tasks (here acute renal failure), prompt-finetuned LLMs can deliver meaningful improvements and are worth testing against sequence models like RETAIN.

Evidence RefTable 2; section 3.2.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
PR-AUC35.962%Logistic Regression 35.050%+0.912% abseICU-CRD respiratory failureTable 2 rows for respiratory failureTable 2
PR-AUC45.442%RETAIN 43.603%+4.22% absMIMIC-IV unspecified renal failureTable 2 rows for unspecified renal failureTable 2

What To Try In 7 Days

Fine‑tune a 2–13B LLM on a small EHR slice using QLoRA/PEFT and your prompt template to validate PR‑AUC.

Add missing medical description tokens to the tokenizer and re-run to measure tokenization gains.

Compare CPLLM PR‑AUC to a logistic regression and one sequence baseline (RETAIN/Med‑BERT) on a key prediction task.

Optimization Features

Token Efficiency
Added domain tokens to tokenizer
Infra Optimization
Single‑GPU fine‑tuning demonstrated (RTX6000)
Model Optimization
LoRA
Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Modest absolute PR‑AUC gains on some tasks; not all improvements are large.

Evaluation uses three runs; some variance (wide CIs) reported, especially for smaller BioMedLM in some settings.

When Not To Use

When GPU/compute budget is very limited and simpler models already meet performance needs.

When extremely long patient sequences are the focus; datasets here lacked very long histories.

Failure Modes

Overfitting or instability when dataset is small or class imbalance extreme (confidence intervals widen).

Prompt sensitivity: model requires prompt engineering and tokenizer changes that may not generalize.

Core Entities

Models

Llama2-13BBioMedLM-2.7B (PubMedGPT)Med-BERTRETAINLogistic RegressionConCaredeeperGRASP

Metrics

PR-AUCROC-AUC

Datasets

MIMIC-IV v2.0eICU-CRD

Benchmarks

PyHealth