MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

November 27, 20238 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

117

Authors

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

Links

Abstract / PDF

Why It Matters For Business

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Summary TLDR

The authors release MEDITRON, open-source medical language models at 7B and 70B parameters. They continue-pretrain Llama-2 on a curated 46.8B–48.1B token corpus (PubMed papers, PubMed abstracts, and 46K clinical guidelines plus a 1% replay of general data). MEDITRON improves accuracy on four medical benchmarks versus matched open baselines (up to ~6% absolute gain reported) and is competitive with much larger closed models (beats GPT-3.5, near GPT-4 / Med-PaLM on some tasks). Models, corpus curation code, and training library are publicly released. Authors warn against clinical deployment without alignment and trials.

Problem Statement

Large, high-performing medical LLMs are mostly closed-source or only available at small scales. The field needs an open, large-scale medical LLM and a reproducible pretraining pipeline built from high-quality medical data.

Main Contribution

Open-source medical LLMs at two scales: MEDITRON-7B and MEDITRON-70B, weights released.

A curated continued-pretraining mixture (GAP-REPLAY) of ~46.8–48.1B tokens: PubMed full papers, PubMed abstracts, 46K clinical guidelines, plus a small general-domain replay.

Extensions to Megatron-LM (Megatron-LLM) to support Llama-2 training and large-scale distributed pretraining.

Extensive automatic evaluation on four medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Medical) showing consistent gains over open baselines and competitiveness with larger closed models.

A public release of corpus-scraping and preprocessing code and a subset of guideline texts.

Key Findings

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

NumbersAvg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

MEDITRON-70B reaches high single-task scores and benefits from reasoning prompts.

NumbersMedQA-4-option 70.2% (SC-CoT); overall SC-CoT avg 72.0%

A curated pretraining mixture that includes guidelines and replay improves downstream results.

NumbersGAP+Replay mixture yields avg accuracy 58.9% (trial 7B finetuned) vs PMC-only 54.6% (Table 8)

MEDITRON-70B is competitive with much larger closed models on some benchmarks.

NumbersOutperforms GPT-3.5 on all evaluated benchmarks; within ~0.2–5% of top closed models on PubMedQA and within 5–10% on a 4

Results

Accuracy

Value70.2%

BaselineLlama-2-70B (63.8% top-token in figure)

Accuracy

Value72.0% (avg with SC-CoT)

BaselineLlama-2-70B (69.2% with SC-CoT)

Accuracy

Value63.3% average (5-shot)

BaselineLlama-2-70B base (60.8% at iteration 0)

Truthfulness (TruthfulQA, medical categories) MEDITRON-70B

Value71.2% average

BaselineLlama-2-70B 54.8% average

Who Should Care

What To Try In 7 Days

Download MEDITRON-7B or 70B weights and run inference on a held-out internal medical QA set.

Reproduce one finetune on MedMCQA or PubMedQA using the provided Megatron-LLM scripts.

Evaluate SC-CoT and self-consistency decoding to measure easy accuracy gains on multiple-choice medical QA tasks.

Optimization Features

Token Efficiency

  • Context lengths: 2048 (7B) and 4096 (70B)
  • Byte-pair encoding tokenizer (32k vocab) inherited from Llama

Infra Optimization

  • RDMA over Converged Ethernet for inter-node comms
  • Micro-batch size 2, global batch 512 to match memory constraints

Model Optimization

  • Use of Llama-2 architecture and rotary embeddings
  • Group-query attention (GQA) from Llama-2

System Optimization

  • Cluster: 16 nodes × 8 A100 80GB GPUs, NVLink/NVSwitch
  • Empirically chosen TP = PP = 8 for best per-GPU throughput

Training Optimization

  • Megatron-LLM distributed training with TP/PP/DP (3D parallelism)
  • Activation recomputation and sequence parallelism
  • Cosine learning rate schedule with warmup

Inference Optimization

  • FlashAttention and FlashAttention-2 support for efficient decoding
  • Self-consistency sampling to improve reasoning answers

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not ready for clinical deployment without alignment and trials; authors explicitly warn against clinical use.
  • Knowledge cutoff August 2023; may be outdated for recent medical guidance.
  • Benchmarks are multiple-choice exams and may not reflect real-world clinical needs or safety.
  • Some guideline sources could not be redistributed; public release is partial.

When Not To Use

  • Do not use for unsupervised clinical decision making or actionable patient care.
  • Do not deploy in emergency care, triage, or any setting requiring guaranteed safety.
  • Avoid relying on outputs without human medical oversight and external validation.

Failure Modes

  • Hallucinations or confidently incorrect medical advice.
  • Outdated recommendations due to knowledge cutoff.
  • Biases and harmful stereotypes learned from training data.
  • Overconfidence on tasks outside evaluated benchmarks.

Core Entities

Models

  • MEDITRON-7B
  • MEDITRON-70B
  • Llama-2-7B
  • Llama-2-70B
  • PMC-Llama-7B
  • Med42-70B
  • Clinical-Camel-70B
  • Mistral-7B
  • Zephyr-7B
  • GPT-3.5
  • GPT-4
  • Med-PaLM
  • Med-PaLM-2

Metrics

  • Accuracy
  • training loss
  • validation loss

Datasets

  • GAP-REPLAY
  • Clinical Guidelines (GUIDELINES)
  • PubMed Papers (PMC)
  • PubMed Abstracts
  • RedPajama (replay subset)
  • MedQA
  • MedMCQA
  • PubMedQA
  • MMLU-Medical

Benchmarks

  • MedQA
  • MedMCQA
  • PubMedQA
  • MMLU-Medical
  • TruthfulQA