Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
117
Why It Matters For Business
MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.
Summary TLDR
The authors release MEDITRON, open-source medical language models at 7B and 70B parameters. They continue-pretrain Llama-2 on a curated 46.8B–48.1B token corpus (PubMed papers, PubMed abstracts, and 46K clinical guidelines plus a 1% replay of general data). MEDITRON improves accuracy on four medical benchmarks versus matched open baselines (up to ~6% absolute gain reported) and is competitive with much larger closed models (beats GPT-3.5, near GPT-4 / Med-PaLM on some tasks). Models, corpus curation code, and training library are publicly released. Authors warn against clinical deployment without alignment and trials.
Problem Statement
Large, high-performing medical LLMs are mostly closed-source or only available at small scales. The field needs an open, large-scale medical LLM and a reproducible pretraining pipeline built from high-quality medical data.
Main Contribution
Open-source medical LLMs at two scales: MEDITRON-7B and MEDITRON-70B, weights released.
A curated continued-pretraining mixture (GAP-REPLAY) of ~46.8–48.1B tokens: PubMed full papers, PubMed abstracts, 46K clinical guidelines, plus a small general-domain replay.
Extensions to Megatron-LM (Megatron-LLM) to support Llama-2 training and large-scale distributed pretraining.
Extensive automatic evaluation on four medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Medical) showing consistent gains over open baselines and competitiveness with larger closed models.
A public release of corpus-scraping and preprocessing code and a subset of guideline texts.
Key Findings
MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.
MEDITRON-70B reaches high single-task scores and benefits from reasoning prompts.
A curated pretraining mixture that includes guidelines and replay improves downstream results.
MEDITRON-70B is competitive with much larger closed models on some benchmarks.
Results
Accuracy
Accuracy
Accuracy
Truthfulness (TruthfulQA, medical categories) MEDITRON-70B
Who Should Care
What To Try In 7 Days
Download MEDITRON-7B or 70B weights and run inference on a held-out internal medical QA set.
Reproduce one finetune on MedMCQA or PubMedQA using the provided Megatron-LLM scripts.
Evaluate SC-CoT and self-consistency decoding to measure easy accuracy gains on multiple-choice medical QA tasks.
Optimization Features
Token Efficiency
- Context lengths: 2048 (7B) and 4096 (70B)
- Byte-pair encoding tokenizer (32k vocab) inherited from Llama
Infra Optimization
- RDMA over Converged Ethernet for inter-node comms
- Micro-batch size 2, global batch 512 to match memory constraints
Model Optimization
- Use of Llama-2 architecture and rotary embeddings
- Group-query attention (GQA) from Llama-2
System Optimization
- Cluster: 16 nodes × 8 A100 80GB GPUs, NVLink/NVSwitch
- Empirically chosen TP = PP = 8 for best per-GPU throughput
Training Optimization
- Megatron-LLM distributed training with TP/PP/DP (3D parallelism)
- Activation recomputation and sequence parallelism
- Cosine learning rate schedule with warmup
Inference Optimization
- FlashAttention and FlashAttention-2 support for efficient decoding
- Self-consistency sampling to improve reasoning answers
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/datasets/epfLLM/guidelines
- S2ORC (PubMed via S2ORC described in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not ready for clinical deployment without alignment and trials; authors explicitly warn against clinical use.
- Knowledge cutoff August 2023; may be outdated for recent medical guidance.
- Benchmarks are multiple-choice exams and may not reflect real-world clinical needs or safety.
- Some guideline sources could not be redistributed; public release is partial.
When Not To Use
- Do not use for unsupervised clinical decision making or actionable patient care.
- Do not deploy in emergency care, triage, or any setting requiring guaranteed safety.
- Avoid relying on outputs without human medical oversight and external validation.
Failure Modes
- Hallucinations or confidently incorrect medical advice.
- Outdated recommendations due to knowledge cutoff.
- Biases and harmful stereotypes learned from training data.
- Overconfidence on tasks outside evaluated benchmarks.
Core Entities
Models
- MEDITRON-7B
- MEDITRON-70B
- Llama-2-7B
- Llama-2-70B
- PMC-Llama-7B
- Med42-70B
- Clinical-Camel-70B
- Mistral-7B
- Zephyr-7B
- GPT-3.5
- GPT-4
- Med-PaLM
- Med-PaLM-2
Metrics
- Accuracy
- training loss
- validation loss
Datasets
- GAP-REPLAY
- Clinical Guidelines (GUIDELINES)
- PubMed Papers (PMC)
- PubMed Abstracts
- RedPajama (replay subset)
- MedQA
- MedMCQA
- PubMedQA
- MMLU-Medical
Benchmarks
- MedQA
- MedMCQA
- PubMedQA
- MMLU-Medical
- TruthfulQA

