Overview
Promising for research and on-prem experimentation, but not yet safe for direct clinical use because hallucination risk and single-turn limits remain.
Citations5
Evidence Strength0.75
Confidence0.90
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 20%
Novelty: 65%
Why It Matters For Business
You can train and host a capable clinical LLM without private patient notes, lowering legal barriers and API costs while keeping models runnable inside hospitals or on-prem.
Who Should Care
Summary TLDR
The authors build Asclepius, clinical LLMs (7B and 13B parameters) trained only on 158k synthetic discharge summaries derived from public case reports. Synthetic notes were made to mimic real clinical notes and show similar statistical style (perplexity). On multiple real-note benchmarks, GPT-4 and clinician evaluations find Asclepius-13B performs close to GPT-3.5-turbo and comparably to a variant trained on 57k real notes. All code, models, and synthetic data are released for research; model not yet recommended for clinical deployment.
Problem Statement
Real clinical notes are private and hard to share, which blocks building and releasing task-capable clinical LLMs. Can publicly available case reports be converted into synthetic clinical notes and used to train a high-quality, shareable clinical LLM without using private patient data?
Main Contribution
Created 158k synthetic discharge summaries from PMC case reports and produced 158,114 instruction–answer pairs for clinical NLP tasks.
Released Asclepius-7B and Asclepius-13B: clinical LLMs trained on synthetic notes with domain-adaptive pretraining then instruction fine-tuning.
Key Findings
Synthetic notes reach realistic language statistics after conversion.
Models trained on synthetic notes perform close to real-note-trained models by human experts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Synthetic note perplexity (vs real) | Synthetic notes 4.816; PMC-Patients raw 71.719; MIMIC-III 2.186; i2b2 5.178 | Perplexity of real hospital notes (MIMIC-III..i2b2) | Synthetic dropped from 71.719 to 4.816 after conversion | 200-sample comparisons described in Section 2.1 | Perplexity table in Section 2.1 | Section 2.1 |
| Clinician average quality score (4=best) | Asclepius-13B 3.03; Asclepius-R-13B 3.15 | Asclepius-R-13B (real-note trained) | Difference 0.12; paired t-test p=0.18 (no significant difference at n=100) | 100 DiSCQ questions | Professional evaluation in Section 4.3 | Section 4.3 |
What To Try In 7 Days
Download the released synthetic dataset and run quick fine-tuning on a 7B LLaMA checkpoint to test your local pipeline.
Run a small clinician review on 50 model outputs to measure hallucination and task fit before deeper investment.
Compare prompt-based performance vs your current API usage to estimate potential cost and latency savings.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Focused only on discharge summaries; other note types not evaluated.
Model supports single-turn instruction following only; no multi-turn dialogue tested.
When Not To Use
Do not deploy for live patient care without thorough local validation and safeguards.
Avoid using for multi-turn clinical decision support; model supports only single-turn tasks.
Failure Modes
Hallucinated entities or facts not present in source notes.
Incorrect clinical interpretations leading to unsafe recommendations.

