Overview
Production Readiness
0.5
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
9
Why It Matters For Business
CPLLM shows you can repurpose public LLMs for EHR forecasting with no domain pretraining, achieving modest but consistent gains; this can speed deployment and lower data‑preparation costs.
Summary TLDR
This paper introduces CPLLM, a workflow that fine-tunes pre-trained LLMs (Llama2-13B and BioMedLM-2.7B) on structured EHR data turned into text prompts. Using quantized PEFT (QLoRA/LoRA) and a small number of added medical tokens, CPLLM outperformed baselines (Med‑BERT, RETAIN, logistic regression, PyHealth models) on three diagnosis tasks and readmission prediction (MIMIC‑IV and eICU‑CRD). The method needs no clinical pretraining, supports longer sequences (up to 4k tokens with Llama2), and is practical to run on a single GPU for fine-tuning.
Problem Statement
Can off‑the‑shelf LLMs be fine‑tuned (with quantization and prompt-style inputs) to predict future diagnoses and short-term hospital readmission from structured EHR sequences, without extra domain pretraining or visit-level timing?
Main Contribution
CPLLM: a prompt-based fine-tuning pipeline that encodes EHR diagnosis/procedure/drug sequences as text and trains LLMs for prediction.
Demonstration that quantized PEFT (QLoRA/LoRA) enables single‑GPU fine-tuning of large models (Llama2 and BioMedLM) for clinical prediction without clinical pretraining.
Empirical gains over Med‑BERT, RETAIN, logistic regression, and PyHealth baselines on diagnosis and readmission tasks; plus an ablation showing added domain tokens usually help.
Key Findings
CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.
CPLLM-Llama2 yields the largest PR-AUC gains for acute renal failure.
CPLLM achieves top readmission prediction results on MIMIC‑IV and eICU‑CRD.
Adding domain tokens usually improves performance.
CPLLM fine-tuning can run on commodity GPU using quantized PEFT.
Results
PR-AUC
PR-AUC
PR-AUC
PR-AUC
PR-AUC change from tokenizer tokens
Who Should Care
What To Try In 7 Days
Fine‑tune a 2–13B LLM on a small EHR slice using QLoRA/PEFT and your prompt template to validate PR‑AUC.
Add missing medical description tokens to the tokenizer and re-run to measure tokenization gains.
Compare CPLLM PR‑AUC to a logistic regression and one sequence baseline (RETAIN/Med‑BERT) on a key prediction task.
Optimization Features
Token Efficiency
- Added domain tokens to tokenizer
Infra Optimization
- Single‑GPU fine‑tuning demonstrated (RTX6000)
Model Optimization
- LoRA
Training Optimization
- LoRA
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Modest absolute PR‑AUC gains on some tasks; not all improvements are large.
- Evaluation uses three runs; some variance (wide CIs) reported, especially for smaller BioMedLM in some settings.
- Higher compute and inference cost versus lightweight baselines; Llama2 fine-tuning took ~1 day on RTX6000.
When Not To Use
- When GPU/compute budget is very limited and simpler models already meet performance needs.
- When extremely long patient sequences are the focus; datasets here lacked very long histories.
- When explainability needs require transparent, simple models rather than LLMs.
Failure Modes
- Overfitting or instability when dataset is small or class imbalance extreme (confidence intervals widen).
- Prompt sensitivity: model requires prompt engineering and tokenizer changes that may not generalize.
- Higher latency and cost during inference compared to compact clinical models.
Core Entities
Models
- Llama2-13B
- BioMedLM-2.7B (PubMedGPT)
- Med-BERT
- RETAIN
- Logistic Regression
- ConCare
- deeper
- GRASP
Metrics
- PR-AUC
- ROC-AUC
Datasets
- MIMIC-IV v2.0
- eICU-CRD
Benchmarks
- PyHealth

