Train a usable clinical LLM from 158k synthetic discharge summaries and share it publicly

September 1, 20238 min

Overview

Decision SnapshotNeeds Validation

Promising for research and on-prem experimentation, but not yet safe for direct clinical use because hallucination risk and single-turn limits remain.

Citations5

Evidence Strength0.75

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 20%

Novelty: 65%

Authors

Sunjun Kweon, Junu Kim, Jiyoun Kim, Sujeong Im, Eunbyeol Cho, Seongsu Bae, Jungwoo Oh, Gyubok Lee, Jong Hak Moon, Seng Chan You, Seungjin Baek, Chang Hoon Han, Yoon Bin Jung, Yohan Jo, Edward Choi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can train and host a capable clinical LLM without private patient notes, lowering legal barriers and API costs while keeping models runnable inside hospitals or on-prem.

Who Should Care

Summary TLDR

The authors build Asclepius, clinical LLMs (7B and 13B parameters) trained only on 158k synthetic discharge summaries derived from public case reports. Synthetic notes were made to mimic real clinical notes and show similar statistical style (perplexity). On multiple real-note benchmarks, GPT-4 and clinician evaluations find Asclepius-13B performs close to GPT-3.5-turbo and comparably to a variant trained on 57k real notes. All code, models, and synthetic data are released for research; model not yet recommended for clinical deployment.

Problem Statement

Real clinical notes are private and hard to share, which blocks building and releasing task-capable clinical LLMs. Can publicly available case reports be converted into synthetic clinical notes and used to train a high-quality, shareable clinical LLM without using private patient data?

Main Contribution

Created 158k synthetic discharge summaries from PMC case reports and produced 158,114 instruction–answer pairs for clinical NLP tasks.

Released Asclepius-7B and Asclepius-13B: clinical LLMs trained on synthetic notes with domain-adaptive pretraining then instruction fine-tuning.

Key Findings

Synthetic notes reach realistic language statistics after conversion.

NumbersPerplexity: synthetic 4.816 vs real hospital range 2.1865.178

Practical UseYou can transform public case reports into discharge-style notes that look statistically like real notes; use them to pretrain or adapt LMs without private data.

Evidence RefSection 2.1, perplexity table

Models trained on synthetic notes perform close to real-note-trained models by human experts.

NumbersClinician average scores: Asclepius-13B 3.03 vs Asclepius-R-13B 3.15 (p=0.18)

Practical UseFor many clinical NLP tasks, a synthetic-data-trained model can be a practical alternative to one trained on protected patient notes—after local validation.

Evidence RefSection 4.3 professional evaluation

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Synthetic note perplexity (vs real)Synthetic notes 4.816; PMC-Patients raw 71.719; MIMIC-III 2.186; i2b2 5.178Perplexity of real hospital notes (MIMIC-III..i2b2)Synthetic dropped from 71.719 to 4.816 after conversion200-sample comparisons described in Section 2.1Perplexity table in Section 2.1Section 2.1
Clinician average quality score (4=best)Asclepius-13B 3.03; Asclepius-R-13B 3.15Asclepius-R-13B (real-note trained)Difference 0.12; paired t-test p=0.18 (no significant difference at n=100)100 DiSCQ questionsProfessional evaluation in Section 4.3Section 4.3

What To Try In 7 Days

Download the released synthetic dataset and run quick fine-tuning on a 7B LLaMA checkpoint to test your local pipeline.

Run a small clinician review on 50 model outputs to measure hallucination and task fit before deeper investment.

Compare prompt-based performance vs your current API usage to estimate potential cost and latency savings.

Optimization Features

Token Efficiency
Extended context length to 2048 tokens to fit discharge summaries
Infra Optimization
Training done on 8x A100 80G or 8x A6000 48G GPUs
Model Optimization
Domain-adaptive pretraining on synthetic notes
System Optimization
Used efficient training techniques to offset longer contexts
Training Optimization
One pretraining epoch + three instruction fine-tuning epochsLearning rate 2e-5, global batch size 128

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Focused only on discharge summaries; other note types not evaluated.

Model supports single-turn instruction following only; no multi-turn dialogue tested.

When Not To Use

Do not deploy for live patient care without thorough local validation and safeguards.

Avoid using for multi-turn clinical decision support; model supports only single-turn tasks.

Failure Modes

Hallucinated entities or facts not present in source notes.

Incorrect clinical interpretations leading to unsafe recommendations.

Core Entities

Models

Asclepius-7BAsclepius-13BAsclepius-R-13BLLaMAGPT-3.5-turboGPT-4AlpacaVicunaMedAlpacaChatDoctorClinical-Camel

Metrics

Perplexity4-point clinician quality scoreKrippendorff's alphap-value (paired t-test)Pearson/Kendall/Spearman correlationsUnacceptable-response counts (hallucination proxy)

Datasets

PMC-PatientsMIMIC-IIIMIMIC-IVi2b2CASIMTSamplesDiSCQ

Benchmarks

DiSCQ (clinician questions)CASI (Abbreviation Expansion, Coreference)MIMIC-derived discharge-summary instruction tasks