Overview
Results show consistent PR‑AUC and ROC‑AUC improvements on two public datasets, but gains are modest, experiments are limited to three runs, and compute costs for large LLMs are nontrivial.
Citations9
Evidence Strength0.70
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 50%
Why It Matters For Business
CPLLM shows you can repurpose public LLMs for EHR forecasting with no domain pretraining, achieving modest but consistent gains; this can speed deployment and lower data‑preparation costs.
Who Should Care
Summary TLDR
This paper introduces CPLLM, a workflow that fine-tunes pre-trained LLMs (Llama2-13B and BioMedLM-2.7B) on structured EHR data turned into text prompts. Using quantized PEFT (QLoRA/LoRA) and a small number of added medical tokens, CPLLM outperformed baselines (Med‑BERT, RETAIN, logistic regression, PyHealth models) on three diagnosis tasks and readmission prediction (MIMIC‑IV and eICU‑CRD). The method needs no clinical pretraining, supports longer sequences (up to 4k tokens with Llama2), and is practical to run on a single GPU for fine-tuning.
Problem Statement
Can off‑the‑shelf LLMs be fine‑tuned (with quantization and prompt-style inputs) to predict future diagnoses and short-term hospital readmission from structured EHR sequences, without extra domain pretraining or visit-level timing?
Main Contribution
CPLLM: a prompt-based fine-tuning pipeline that encodes EHR diagnosis/procedure/drug sequences as text and trains LLMs for prediction.
Demonstration that quantized PEFT (QLoRA/LoRA) enables single‑GPU fine-tuning of large models (Llama2 and BioMedLM) for clinical prediction without clinical pretraining.
Key Findings
CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.
CPLLM-Llama2 yields the largest PR-AUC gains for acute renal failure.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| PR-AUC | 35.962% | Logistic Regression 35.050% | +0.912% abs | eICU-CRD respiratory failure | Table 2 rows for respiratory failure | Table 2 |
| PR-AUC | 45.442% | RETAIN 43.603% | +4.22% abs | MIMIC-IV unspecified renal failure | Table 2 rows for unspecified renal failure | Table 2 |
What To Try In 7 Days
Fine‑tune a 2–13B LLM on a small EHR slice using QLoRA/PEFT and your prompt template to validate PR‑AUC.
Add missing medical description tokens to the tokenizer and re-run to measure tokenization gains.
Compare CPLLM PR‑AUC to a logistic regression and one sequence baseline (RETAIN/Med‑BERT) on a key prediction task.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Modest absolute PR‑AUC gains on some tasks; not all improvements are large.
Evaluation uses three runs; some variance (wide CIs) reported, especially for smaller BioMedLM in some settings.
When Not To Use
When GPU/compute budget is very limited and simpler models already meet performance needs.
When extremely long patient sequences are the focus; datasets here lacked very long histories.
Failure Modes
Overfitting or instability when dataset is small or class imbalance extreme (confidence intervals widen).
Prompt sensitivity: model requires prompt engineering and tokenizer changes that may not generalize.

