Train a usable clinical LLM from 158k synthetic discharge summaries and share it publicly

September 1, 20238 min

Overview

Production Readiness

0.2

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

5

Authors

Sunjun Kweon, Junu Kim, Jiyoun Kim, Sujeong Im, Eunbyeol Cho, Seongsu Bae, Jungwoo Oh, Gyubok Lee, Jong Hak Moon, Seng Chan You, Seungjin Baek, Chang Hoon Han, Yoon Bin Jung, Yohan Jo, Edward Choi

Links

Abstract / PDF

Why It Matters For Business

You can train and host a capable clinical LLM without private patient notes, lowering legal barriers and API costs while keeping models runnable inside hospitals or on-prem.

Summary TLDR

The authors build Asclepius, clinical LLMs (7B and 13B parameters) trained only on 158k synthetic discharge summaries derived from public case reports. Synthetic notes were made to mimic real clinical notes and show similar statistical style (perplexity). On multiple real-note benchmarks, GPT-4 and clinician evaluations find Asclepius-13B performs close to GPT-3.5-turbo and comparably to a variant trained on 57k real notes. All code, models, and synthetic data are released for research; model not yet recommended for clinical deployment.

Problem Statement

Real clinical notes are private and hard to share, which blocks building and releasing task-capable clinical LLMs. Can publicly available case reports be converted into synthetic clinical notes and used to train a high-quality, shareable clinical LLM without using private patient data?

Main Contribution

Created 158k synthetic discharge summaries from PMC case reports and produced 158,114 instruction–answer pairs for clinical NLP tasks.

Released Asclepius-7B and Asclepius-13B: clinical LLMs trained on synthetic notes with domain-adaptive pretraining then instruction fine-tuning.

Evaluated models on real clinical notes using automated GPT-4 scoring and clinician ratings, showing synthetic-trained models match performance of real-note-trained variants on tested benchmarks.

Open-sourced weights, code, and synthetic data for research and reuse.

Key Findings

Synthetic notes reach realistic language statistics after conversion.

NumbersPerplexity: synthetic 4.816 vs real hospital range 2.186–5.178

Models trained on synthetic notes perform close to real-note-trained models by human experts.

NumbersClinician average scores: Asclepius-13B 3.03 vs Asclepius-R-13B 3.15 (p=0.18)

Asclepius-13B is competitive with larger API models on tested tasks.

NumbersGPT-4 eval: Asclepius-13B ~3.36 vs GPT-3.5-turbo 3.46 on MIMIC-III test (4-point scale)

Hallucination rates are similar between synthetic- and real-note trained models.

NumbersResponses scored 'unacceptable' by all clinicians: Asclepius 12/100 vs Asclepius-R 10/100

Dataset and model scale matter: more synthetic notes improved performance.

NumbersAsclepius-57k (57k notes) showed slight drop compared to Asclepius-158k

Results

Synthetic note perplexity (vs real)

ValueSynthetic notes 4.816; PMC-Patients raw 71.719; MIMIC-III 2.186; i2b2 5.178

BaselinePerplexity of real hospital notes (MIMIC-III..i2b2)

Clinician average quality score (4=best)

ValueAsclepius-13B 3.03; Asclepius-R-13B 3.15

BaselineAsclepius-R-13B (real-note trained)

GPT-4 automated score (4-point scale) on MIMIC-III

ValueAsclepius-13B 3.36; GPT-3.5-turbo 3.46

BaselineGPT-3.5-turbo

Hallucination proxy (unacceptable responses count)

ValueAsclepius-13B 12/100; Asclepius-R-13B 10/100

BaselineAsclepius-R-13B

Automated–human alignment

ValuePearson 0.41; Kendall-Tau 0.36; Spearman 0.39

BaselineCorrelation between GPT-4 and clinicians

Who Should Care

What To Try In 7 Days

Download the released synthetic dataset and run quick fine-tuning on a 7B LLaMA checkpoint to test your local pipeline.

Run a small clinician review on 50 model outputs to measure hallucination and task fit before deeper investment.

Compare prompt-based performance vs your current API usage to estimate potential cost and latency savings.

Optimization Features

Token Efficiency

  • Extended context length to 2048 tokens to fit discharge summaries

Infra Optimization

  • Training done on 8x A100 80G or 8x A6000 48G GPUs

Model Optimization

  • Domain-adaptive pretraining on synthetic notes

System Optimization

  • Used efficient training techniques to offset longer contexts

Training Optimization

  • One pretraining epoch + three instruction fine-tuning epochs
  • Learning rate 2e-5, global batch size 128

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Focused only on discharge summaries; other note types not evaluated.
  • Model supports single-turn instruction following only; no multi-turn dialogue tested.
  • Initial synthetic data generation used GPT outputs; licensing/terms may restrict some downstream uses.
  • Hallucination behavior not exhaustively analyzed; some unsafe responses exist.
  • Clinician evaluation sample size is modest (n=100), limiting statistical power.

When Not To Use

  • Do not deploy for live patient care without thorough local validation and safeguards.
  • Avoid using for multi-turn clinical decision support; model supports only single-turn tasks.
  • Do not assume performance on other note types (radiology, nursing, progress notes).

Failure Modes

  • Hallucinated entities or facts not present in source notes.
  • Incorrect clinical interpretations leading to unsafe recommendations.
  • False assertions that an item does not exist when it does.
  • Ambiguous answers that could be misread by clinicians.

Core Entities

Models

  • Asclepius-7B
  • Asclepius-13B
  • Asclepius-R-13B
  • LLaMA
  • GPT-3.5-turbo
  • GPT-4
  • Alpaca
  • Vicuna
  • MedAlpaca
  • ChatDoctor
  • Clinical-Camel

Metrics

  • Perplexity
  • 4-point clinician quality score
  • Krippendorff's alpha
  • p-value (paired t-test)
  • Pearson/Kendall/Spearman correlations
  • Unacceptable-response counts (hallucination proxy)

Datasets

  • PMC-Patients
  • MIMIC-III
  • MIMIC-IV
  • i2b2
  • CASI
  • MTSamples
  • DiSCQ

Benchmarks

  • DiSCQ (clinician questions)
  • CASI (Abbreviation Expansion, Coreference)
  • MIMIC-derived discharge-summary instruction tasks