Fine-tune quantized LLMs on tokenized EHR histories to beat Med‑BERT and other baselines on diagnosis and readmission prediction

Overview

Decision SnapshotNeeds Validation

Results show consistent PR‑AUC and ROC‑AUC improvements on two public datasets, but gains are modest, experiments are limited to three runs, and compute costs for large LLMs are nontrivial.

Citations9

Evidence Strength0.70

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 50%

Authors

Ofir Ben Shoham, Nadav Rappoport

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CPLLM shows you can repurpose public LLMs for EHR forecasting with no domain pretraining, achieving modest but consistent gains; this can speed deployment and lower data‑preparation costs.

Who Should Care

ML Engineer Product Manager Founder Data Scientist

Summary TLDR

This paper introduces CPLLM, a workflow that fine-tunes pre-trained LLMs (Llama2-13B and BioMedLM-2.7B) on structured EHR data turned into text prompts. Using quantized PEFT (QLoRA/LoRA) and a small number of added medical tokens, CPLLM outperformed baselines (Med‑BERT, RETAIN, logistic regression, PyHealth models) on three diagnosis tasks and readmission prediction (MIMIC‑IV and eICU‑CRD). The method needs no clinical pretraining, supports longer sequences (up to 4k tokens with Llama2), and is practical to run on a single GPU for fine-tuning.

Problem Statement

Can off‑the‑shelf LLMs be fine‑tuned (with quantization and prompt-style inputs) to predict future diagnoses and short-term hospital readmission from structured EHR sequences, without extra domain pretraining or visit-level timing?

Main Contribution

CPLLM: a prompt-based fine-tuning pipeline that encodes EHR diagnosis/procedure/drug sequences as text and trains LLMs for prediction.

Demonstration that quantized PEFT (QLoRA/LoRA) enables single‑GPU fine-tuning of large models (Llama2 and BioMedLM) for clinical prediction without clinical pretraining.

Key Findings

CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.

NumbersPR-AUC 35.962% vs 35.050% (LogReg), +0.912% abs

Practical UseIf you fine-tune Llama2 with CPLLM prompts on similar EHR data, expect small but measurable PR-AUC gains over simple baselines on this task.

Evidence RefTable 2; section 3.2.1

CPLLM-Llama2 yields the largest PR-AUC gains for acute renal failure.

NumbersPR-AUC 45.442% vs 43.603% (RETAIN), +4.22% abs

Practical UseFor some diagnosis tasks (here acute renal failure), prompt-finetuned LLMs can deliver meaningful improvements and are worth testing against sequence models like RETAIN.

Evidence RefTable 2; section 3.2.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
PR-AUC	35.962%	Logistic Regression 35.050%	+0.912% abs	eICU-CRD respiratory failure	Table 2 rows for respiratory failure	Table 2
PR-AUC	45.442%	RETAIN 43.603%	+4.22% abs	MIMIC-IV unspecified renal failure	Table 2 rows for unspecified renal failure	Table 2

What To Try In 7 Days

Fine‑tune a 2–13B LLM on a small EHR slice using QLoRA/PEFT and your prompt template to validate PR‑AUC.

Add missing medical description tokens to the tokenizer and re-run to measure tokenization gains.

Compare CPLLM PR‑AUC to a logistic regression and one sequence baseline (RETAIN/Med‑BERT) on a key prediction task.

Optimization Features

Token Efficiency

Added domain tokens to tokenizer

Infra Optimization

Single‑GPU fine‑tuning demonstrated (RTX6000)

Model Optimization

LoRA

Training Optimization

LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/nadavlab/CPLLM

Data URLs

https://physionet.org/content/mimiciv/2.0/https://physionet.org/content/eicu-crd/2.0/

Risks & Boundaries

Limitations

Modest absolute PR‑AUC gains on some tasks; not all improvements are large.

Evaluation uses three runs; some variance (wide CIs) reported, especially for smaller BioMedLM in some settings.

When Not To Use

When GPU/compute budget is very limited and simpler models already meet performance needs.

When extremely long patient sequences are the focus; datasets here lacked very long histories.

Failure Modes

Overfitting or instability when dataset is small or class imbalance extreme (confidence intervals widen).

Prompt sensitivity: model requires prompt engineering and tokenizer changes that may not generalize.

Core Entities

Models

Llama2-13BBioMedLM-2.7B (PubMedGPT)Med-BERTRETAINLogistic RegressionConCaredeeperGRASP

Metrics

PR-AUCROC-AUC

Datasets

MIMIC-IV v2.0eICU-CRD

Benchmarks

PyHealth

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.

CPLLM-Llama2 yields the largest PR-AUC gains for acute renal failure.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding