Train a usable clinical LLM from 158k synthetic discharge summaries and share it publicly

Overview

Decision SnapshotNeeds Validation

Promising for research and on-prem experimentation, but not yet safe for direct clinical use because hallucination risk and single-turn limits remain.

Citations5

Evidence Strength0.75

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 20%

Novelty: 65%

Authors

Sunjun Kweon, Junu Kim, Jiyoun Kim, Sujeong Im, Eunbyeol Cho, Seongsu Bae, Jungwoo Oh, Gyubok Lee, Jong Hak Moon, Seng Chan You, Seungjin Baek, Chang Hoon Han, Yoon Bin Jung, Yohan Jo, Edward Choi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can train and host a capable clinical LLM without private patient notes, lowering legal barriers and API costs while keeping models runnable inside hospitals or on-prem.

Who Should Care

CTO ML Engineer Data Scientist Product Manager Founder

Summary TLDR

The authors build Asclepius, clinical LLMs (7B and 13B parameters) trained only on 158k synthetic discharge summaries derived from public case reports. Synthetic notes were made to mimic real clinical notes and show similar statistical style (perplexity). On multiple real-note benchmarks, GPT-4 and clinician evaluations find Asclepius-13B performs close to GPT-3.5-turbo and comparably to a variant trained on 57k real notes. All code, models, and synthetic data are released for research; model not yet recommended for clinical deployment.

Problem Statement

Real clinical notes are private and hard to share, which blocks building and releasing task-capable clinical LLMs. Can publicly available case reports be converted into synthetic clinical notes and used to train a high-quality, shareable clinical LLM without using private patient data?

Main Contribution

Created 158k synthetic discharge summaries from PMC case reports and produced 158,114 instruction–answer pairs for clinical NLP tasks.

Released Asclepius-7B and Asclepius-13B: clinical LLMs trained on synthetic notes with domain-adaptive pretraining then instruction fine-tuning.

Key Findings

Synthetic notes reach realistic language statistics after conversion.

NumbersPerplexity: synthetic 4.816 vs real hospital range 2.186–5.178

Practical UseYou can transform public case reports into discharge-style notes that look statistically like real notes; use them to pretrain or adapt LMs without private data.

Evidence RefSection 2.1, perplexity table

Models trained on synthetic notes perform close to real-note-trained models by human experts.

NumbersClinician average scores: Asclepius-13B 3.03 vs Asclepius-R-13B 3.15 (p=0.18)

Practical UseFor many clinical NLP tasks, a synthetic-data-trained model can be a practical alternative to one trained on protected patient notes—after local validation.

Evidence RefSection 4.3 professional evaluation

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Synthetic note perplexity (vs real)	Synthetic notes 4.816; PMC-Patients raw 71.719; MIMIC-III 2.186; i2b2 5.178	Perplexity of real hospital notes (MIMIC-III..i2b2)	Synthetic dropped from 71.719 to 4.816 after conversion	200-sample comparisons described in Section 2.1	Perplexity table in Section 2.1	Section 2.1
Clinician average quality score (4=best)	Asclepius-13B 3.03; Asclepius-R-13B 3.15	Asclepius-R-13B (real-note trained)	Difference 0.12; paired t-test p=0.18 (no significant difference at n=100)	100 DiSCQ questions	Professional evaluation in Section 4.3	Section 4.3

What To Try In 7 Days

Download the released synthetic dataset and run quick fine-tuning on a 7B LLaMA checkpoint to test your local pipeline.

Run a small clinician review on 50 model outputs to measure hallucination and task fit before deeper investment.

Compare prompt-based performance vs your current API usage to estimate potential cost and latency savings.

Optimization Features

Token Efficiency

Extended context length to 2048 tokens to fit discharge summaries

Infra Optimization

Training done on 8x A100 80G or 8x A6000 48G GPUs

Model Optimization

Domain-adaptive pretraining on synthetic notes

System Optimization

Used efficient training techniques to offset longer contexts

Training Optimization

One pretraining epoch + three instruction fine-tuning epochsLearning rate 2e-5, global batch size 128

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/starmpcc/Asclepius

Data URLs

https://github.com/starmpcc/Asclepius (synthetic notes and instruction pairs)

Risks & Boundaries

Limitations

Focused only on discharge summaries; other note types not evaluated.

Model supports single-turn instruction following only; no multi-turn dialogue tested.

When Not To Use

Do not deploy for live patient care without thorough local validation and safeguards.

Avoid using for multi-turn clinical decision support; model supports only single-turn tasks.

Failure Modes

Hallucinated entities or facts not present in source notes.

Incorrect clinical interpretations leading to unsafe recommendations.

Core Entities

Models

Asclepius-7BAsclepius-13BAsclepius-R-13BLLaMAGPT-3.5-turboGPT-4AlpacaVicunaMedAlpacaChatDoctorClinical-Camel

Metrics

Perplexity4-point clinician quality scoreKrippendorff's alphap-value (paired t-test)Pearson/Kendall/Spearman correlationsUnacceptable-response counts (hallucination proxy)

Datasets

PMC-PatientsMIMIC-IIIMIMIC-IVi2b2CASIMTSamplesDiSCQ

Benchmarks

DiSCQ (clinician questions)CASI (Abbreviation Expansion, Coreference)MIMIC-derived discharge-summary instruction tasks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Synthetic notes reach realistic language statistics after conversion.

Models trained on synthetic notes perform close to real-note-trained models by human experts.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding