Overview
Production Readiness
0.2
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
5
Why It Matters For Business
You can train and host a capable clinical LLM without private patient notes, lowering legal barriers and API costs while keeping models runnable inside hospitals or on-prem.
Summary TLDR
The authors build Asclepius, clinical LLMs (7B and 13B parameters) trained only on 158k synthetic discharge summaries derived from public case reports. Synthetic notes were made to mimic real clinical notes and show similar statistical style (perplexity). On multiple real-note benchmarks, GPT-4 and clinician evaluations find Asclepius-13B performs close to GPT-3.5-turbo and comparably to a variant trained on 57k real notes. All code, models, and synthetic data are released for research; model not yet recommended for clinical deployment.
Problem Statement
Real clinical notes are private and hard to share, which blocks building and releasing task-capable clinical LLMs. Can publicly available case reports be converted into synthetic clinical notes and used to train a high-quality, shareable clinical LLM without using private patient data?
Main Contribution
Created 158k synthetic discharge summaries from PMC case reports and produced 158,114 instruction–answer pairs for clinical NLP tasks.
Released Asclepius-7B and Asclepius-13B: clinical LLMs trained on synthetic notes with domain-adaptive pretraining then instruction fine-tuning.
Evaluated models on real clinical notes using automated GPT-4 scoring and clinician ratings, showing synthetic-trained models match performance of real-note-trained variants on tested benchmarks.
Open-sourced weights, code, and synthetic data for research and reuse.
Key Findings
Synthetic notes reach realistic language statistics after conversion.
Models trained on synthetic notes perform close to real-note-trained models by human experts.
Asclepius-13B is competitive with larger API models on tested tasks.
Hallucination rates are similar between synthetic- and real-note trained models.
Dataset and model scale matter: more synthetic notes improved performance.
Results
Synthetic note perplexity (vs real)
Clinician average quality score (4=best)
GPT-4 automated score (4-point scale) on MIMIC-III
Hallucination proxy (unacceptable responses count)
Automated–human alignment
Who Should Care
What To Try In 7 Days
Download the released synthetic dataset and run quick fine-tuning on a 7B LLaMA checkpoint to test your local pipeline.
Run a small clinician review on 50 model outputs to measure hallucination and task fit before deeper investment.
Compare prompt-based performance vs your current API usage to estimate potential cost and latency savings.
Optimization Features
Token Efficiency
- Extended context length to 2048 tokens to fit discharge summaries
Infra Optimization
- Training done on 8x A100 80G or 8x A6000 48G GPUs
Model Optimization
- Domain-adaptive pretraining on synthetic notes
System Optimization
- Used efficient training techniques to offset longer contexts
Training Optimization
- One pretraining epoch + three instruction fine-tuning epochs
- Learning rate 2e-5, global batch size 128
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Focused only on discharge summaries; other note types not evaluated.
- Model supports single-turn instruction following only; no multi-turn dialogue tested.
- Initial synthetic data generation used GPT outputs; licensing/terms may restrict some downstream uses.
- Hallucination behavior not exhaustively analyzed; some unsafe responses exist.
- Clinician evaluation sample size is modest (n=100), limiting statistical power.
When Not To Use
- Do not deploy for live patient care without thorough local validation and safeguards.
- Avoid using for multi-turn clinical decision support; model supports only single-turn tasks.
- Do not assume performance on other note types (radiology, nursing, progress notes).
Failure Modes
- Hallucinated entities or facts not present in source notes.
- Incorrect clinical interpretations leading to unsafe recommendations.
- False assertions that an item does not exist when it does.
- Ambiguous answers that could be misread by clinicians.
Core Entities
Models
- Asclepius-7B
- Asclepius-13B
- Asclepius-R-13B
- LLaMA
- GPT-3.5-turbo
- GPT-4
- Alpaca
- Vicuna
- MedAlpaca
- ChatDoctor
- Clinical-Camel
Metrics
- Perplexity
- 4-point clinician quality score
- Krippendorff's alpha
- p-value (paired t-test)
- Pearson/Kendall/Spearman correlations
- Unacceptable-response counts (hallucination proxy)
Datasets
- PMC-Patients
- MIMIC-III
- MIMIC-IV
- i2b2
- CASI
- MTSamples
- DiSCQ
Benchmarks
- DiSCQ (clinician questions)
- CASI (Abbreviation Expansion, Coreference)
- MIMIC-derived discharge-summary instruction tasks

