MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

Overview

Decision SnapshotNeeds Validation

Solid demonstration that continued pretraining on curated medical text improves benchmark scores and truthfulness. Model is useful for research and prototyping but requires extensive alignment, safety testing, and clinical trials before production medical deployment.

Citations117

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 60%

Authors

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

The authors release MEDITRON, open-source medical language models at 7B and 70B parameters. They continue-pretrain Llama-2 on a curated 46.8B–48.1B token corpus (PubMed papers, PubMed abstracts, and 46K clinical guidelines plus a 1% replay of general data). MEDITRON improves accuracy on four medical benchmarks versus matched open baselines (up to ~6% absolute gain reported) and is competitive with much larger closed models (beats GPT-3.5, near GPT-4 / Med-PaLM on some tasks). Models, corpus curation code, and training library are publicly released. Authors warn against clinical deployment without alignment and trials.

Problem Statement

Large, high-performing medical LLMs are mostly closed-source or only available at small scales. The field needs an open, large-scale medical LLM and a reproducible pretraining pipeline built from high-quality medical data.

Main Contribution

Open-source medical LLMs at two scales: MEDITRON-7B and MEDITRON-70B, weights released.

A curated continued-pretraining mixture (GAP-REPLAY) of ~46.8–48.1B tokens: PubMed full papers, PubMed abstracts, 46K clinical guidelines, plus a small general-domain replay.

Key Findings

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

NumbersAvg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

Practical UseIf you need a strong open-source medical model, use MEDITRON weights rather than unfine-tuned Llama-2 or smaller domain models to gain several points of accuracy on standard medical QA tasks.

Evidence RefAbstract; Section 6 (Main Results)

MEDITRON-70B reaches high single-task scores and benefits from reasoning prompts.

NumbersMedQA-4-option 70.2% (SC-CoT); overall SC-CoT avg 72.0%

Practical UseUse chain-of-thought + self-consistency decoding to squeeze extra accuracy from the 70B model for multiple-choice medical QA.

Evidence RefFigure 1; Table 5 (Self-consistency Chain-of-thought)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.2%	Llama-2-70B (63.8% top-token in figure)	+6.4pp vs some baselines (varies by mode)	MedQA-4-option	Table 5; Figure 1	Table 5 (Self-consistency Chain-of-thought)
Accuracy	72.0% (avg with SC-CoT)	Llama-2-70B (69.2% with SC-CoT)	+2.8pp	MedQA, MedMCQA, PubMedQA, MMLU-Medical	Table 5 (Self-consistency Chain-of-thought average)	Table 5

What To Try In 7 Days

Download MEDITRON-7B or 70B weights and run inference on a held-out internal medical QA set.

Reproduce one finetune on MedMCQA or PubMedQA using the provided Megatron-LLM scripts.

Evaluate SC-CoT and self-consistency decoding to measure easy accuracy gains on multiple-choice medical QA tasks.

Optimization Features

Token Efficiency

Context lengths: 2048 (7B) and 4096 (70B)Byte-pair encoding tokenizer (32k vocab) inherited from Llama

Infra Optimization

RDMA over Converged Ethernet for inter-node commsMicro-batch size 2, global batch 512 to match memory constraints

Model Optimization

Use of Llama-2 architecture and rotary embeddingsGroup-query attention (GQA) from Llama-2

System Optimization

Cluster: 16 nodes × 8 A100 80GB GPUs, NVLink/NVSwitchEmpirically chosen TP = PP = 8 for best per-GPU throughput

Training Optimization

Megatron-LLM distributed training with TP/PP/DP (3D parallelism)Activation recomputation and sequence parallelismCosine learning rate schedule with warmup

Inference Optimization

FlashAttention and FlashAttention-2 support for efficient decodingSelf-consistency sampling to improve reasoning answers

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/epfLLM/megatron-LLM https://github.com/epfLLM/meditron https://huggingface.co/epfl-llm/https://huggingface.co/datasets/epfl-llm/guidelines

Data URLs

https://huggingface.co/datasets/epfLLM/guidelinesS2ORC (PubMed via S2ORC described in paper)

Risks & Boundaries

Limitations

Not ready for clinical deployment without alignment and trials; authors explicitly warn against clinical use.

Knowledge cutoff August 2023; may be outdated for recent medical guidance.

When Not To Use

Do not use for unsupervised clinical decision making or actionable patient care.

Do not deploy in emergency care, triage, or any setting requiring guaranteed safety.

Failure Modes

Hallucinations or confidently incorrect medical advice.

Outdated recommendations due to knowledge cutoff.

Core Entities

Models

MEDITRON-7BMEDITRON-70BLlama-2-7BLlama-2-70BPMC-Llama-7BMed42-70BClinical-Camel-70BMistral-7BZephyr-7BGPT-3.5GPT-4Med-PaLMMed-PaLM-2

Metrics

Accuracytraining lossvalidation loss

Datasets

GAP-REPLAYClinical Guidelines (GUIDELINES)PubMed Papers (PMC)PubMed AbstractsRedPajama (replay subset)MedQAMedMCQAPubMedQAMMLU-Medical

Benchmarks

MedQAMedMCQAPubMedQAMMLU-MedicalTruthfulQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

MEDITRON-70B reaches high single-task scores and benefits from reasoning prompts.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding