MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

November 27, 20238 min

Overview

Decision SnapshotNeeds Validation

Solid demonstration that continued pretraining on curated medical text improves benchmark scores and truthfulness. Model is useful for research and prototyping but requires extensive alignment, safety testing, and clinical trials before production medical deployment.

Citations117

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Who Should Care

Summary TLDR

The authors release MEDITRON, open-source medical language models at 7B and 70B parameters. They continue-pretrain Llama-2 on a curated 46.8B–48.1B token corpus (PubMed papers, PubMed abstracts, and 46K clinical guidelines plus a 1% replay of general data). MEDITRON improves accuracy on four medical benchmarks versus matched open baselines (up to ~6% absolute gain reported) and is competitive with much larger closed models (beats GPT-3.5, near GPT-4 / Med-PaLM on some tasks). Models, corpus curation code, and training library are publicly released. Authors warn against clinical deployment without alignment and trials.

Problem Statement

Large, high-performing medical LLMs are mostly closed-source or only available at small scales. The field needs an open, large-scale medical LLM and a reproducible pretraining pipeline built from high-quality medical data.

Main Contribution

Open-source medical LLMs at two scales: MEDITRON-7B and MEDITRON-70B, weights released.

A curated continued-pretraining mixture (GAP-REPLAY) of ~46.8–48.1B tokens: PubMed full papers, PubMed abstracts, 46K clinical guidelines, plus a small general-domain replay.

Key Findings

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

NumbersAvg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

Practical UseIf you need a strong open-source medical model, use MEDITRON weights rather than unfine-tuned Llama-2 or smaller domain models to gain several points of accuracy on standard medical QA tasks.

Evidence RefAbstract; Section 6 (Main Results)

MEDITRON-70B reaches high single-task scores and benefits from reasoning prompts.

NumbersMedQA-4-option 70.2% (SC-CoT); overall SC-CoT avg 72.0%

Practical UseUse chain-of-thought + self-consistency decoding to squeeze extra accuracy from the 70B model for multiple-choice medical QA.

Evidence RefFigure 1; Table 5 (Self-consistency Chain-of-thought)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy70.2%Llama-2-70B (63.8% top-token in figure)+6.4pp vs some baselines (varies by mode)MedQA-4-optionTable 5; Figure 1Table 5 (Self-consistency Chain-of-thought)
Accuracy72.0% (avg with SC-CoT)Llama-2-70B (69.2% with SC-CoT)+2.8ppMedQA, MedMCQA, PubMedQA, MMLU-MedicalTable 5 (Self-consistency Chain-of-thought average)Table 5

What To Try In 7 Days

Download MEDITRON-7B or 70B weights and run inference on a held-out internal medical QA set.

Reproduce one finetune on MedMCQA or PubMedQA using the provided Megatron-LLM scripts.

Evaluate SC-CoT and self-consistency decoding to measure easy accuracy gains on multiple-choice medical QA tasks.

Optimization Features

Token Efficiency
Context lengths: 2048 (7B) and 4096 (70B)Byte-pair encoding tokenizer (32k vocab) inherited from Llama
Infra Optimization
RDMA over Converged Ethernet for inter-node commsMicro-batch size 2, global batch 512 to match memory constraints
Model Optimization
Use of Llama-2 architecture and rotary embeddingsGroup-query attention (GQA) from Llama-2
System Optimization
Cluster: 16 nodes × 8 A100 80GB GPUs, NVLink/NVSwitchEmpirically chosen TP = PP = 8 for best per-GPU throughput
Training Optimization
Megatron-LLM distributed training with TP/PP/DP (3D parallelism)Activation recomputation and sequence parallelismCosine learning rate schedule with warmup
Inference Optimization
FlashAttention and FlashAttention-2 support for efficient decodingSelf-consistency sampling to improve reasoning answers

Reproducibility

Risks & Boundaries

Limitations

Not ready for clinical deployment without alignment and trials; authors explicitly warn against clinical use.

Knowledge cutoff August 2023; may be outdated for recent medical guidance.

When Not To Use

Do not use for unsupervised clinical decision making or actionable patient care.

Do not deploy in emergency care, triage, or any setting requiring guaranteed safety.

Failure Modes

Hallucinations or confidently incorrect medical advice.

Outdated recommendations due to knowledge cutoff.

Core Entities

Models

MEDITRON-7BMEDITRON-70BLlama-2-7BLlama-2-70BPMC-Llama-7BMed42-70BClinical-Camel-70BMistral-7BZephyr-7BGPT-3.5GPT-4Med-PaLMMed-PaLM-2

Metrics

Accuracytraining lossvalidation loss

Datasets

GAP-REPLAYClinical Guidelines (GUIDELINES)PubMed Papers (PMC)PubMed AbstractsRedPajama (replay subset)MedQAMedMCQAPubMedQAMMLU-Medical

Benchmarks

MedQAMedMCQAPubMedQAMMLU-MedicalTruthfulQA