Aloe: open 7B–8B medical LLMs using synthetic Chain-of-Thought, model merging and Direct Preference Optimization

Overview

Decision SnapshotNeeds Validation

Paper provides practical recipes (data curation, synthetic CoT, merging, DPO, prompt configs) and benchmark evidence, but results are limited to 7–8B scale and a specialized test suite, so deploy with caution and add further safety validation.

Citations6

Evidence Strength0.60

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

License: CC BY-NC 4.0 (for the released DPO model)

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Jordi Bayarri-Planas, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Lucia Urcelay-Ganzabal, Marta Gonzalez-Mallo, Sergio Alvarez-Napagao, Eduard Ayguadé-Parra, Ulises Cortés Dario Garcia-Gasulla

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Aloe shows practical, low‑cost ways to push open medical LLMs: generate CoT examples with a stronger model, merge fine‑tuned variants, and use retrieval‑style prompting to get 2–7 point accuracy gains without larger models or expensive pretraining.

Who Should Care

ML Engineer Product Manager CTO Data Scientist Founder

Summary TLDR

Aloe is a family of open medical LLMs (7B–8B) built by fine-tuning strong open base models (Mistral, LLaMA 3). The team mixed large curated medical QA collections, synthetically generated Chain-of-Thought (CoT) answers, model merging, and Direct Preference Optimization (DPO) alignment with red teaming. On common medical benchmarks Aloe (Llama3‑Aloe‑8B‑Alpha) is the strongest open model at its scale, gains ~1–2 pts absolute over Llama‑3‑8B in zero-shot, reaches up to 76.9% average with Medprompt, and shows modest safety improvements after DPO (ASR 0.56→0.52). The code/data/model artifacts for the DPO release are shared under CCBY-NC 4.0. Evidence: benchmark tables, ablations, and red‑teaming/

Problem Statement

Open medical LLMs lag behind closed models. Continued pretraining has limited payoff. Which combination of instruct tuning, synthetic CoT, model merging, alignment (DPO) and prompt engineering gives the best practical gains for competitive, open healthcare LLMs?

Main Contribution

Built Aloe, a family of open healthcare LLMs (Mistral‑ and LLaMA‑based) in the 7B–8B range and released an aligned DPO variant under CCBY‑NC 4.0.

Assembled and cleaned a large supervised fine‑tuning mix: ~348k curated medical QA pairs plus general QA and 210k+ synthetically CoT‑enhanced items, totaling ~500M tokens for SFT.

Key Findings

Aloe's aligned 8B variant outperforms Llama‑3‑8B‑Instruct across medical benchmarks at this size.

NumbersZero‑shot avg: 70.25 vs 68.89 (Llama‑3‑8B) — Table 3

Practical UseIf you need the best open 7–8B medical model today, try Llama3‑Aloe‑8B‑Alpha; expect ~1–2 absolute points gain on evaluated medical QA tasks.

Evidence RefTable 3 (§5.2)

Medprompt (KNN few‑shot + CoT + majority voting) substantially improves accuracy at inference.

NumbersMedprompt peak avg 76.88% for Aloe (20 ensembles) — Table 25; reported +7% vs baseline

Practical UseBefore training more, add Medprompt: use a CoT example DB and 5–20 ensembles to boost accuracy by several points with modest infra setup.

Evidence RefTable 25 (§5.1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.25	Llama‑3‑8B‑Instruct 68.89	+1.36	weighted average across medical benchmarks (Table 3)	Table 3 zero‑shot block	Table 3
Accuracy	76.88	Meta Llama‑3‑8B pubmedbert 74.06 (same setting)	+2.82	Medprompt (20 ensembles) average across medical benchmarks	Table 25	Table 25 (§5.1)

What To Try In 7 Days

Build a small CoT example DB by prompting a stronger model on your task and reuse it for KNN few‑shot prompts.

Evaluate model merging (Mergekit) on two or three fine‑tuned checkpoints before buying larger models.

Run a focused red‑teaming pass and create a tiny DPO preference set for the worst‑case prompts you care about.

Optimization Features

Token Efficiency

use of 5 few‑shots recommended (balance of quality and context length)tradeoff: 20 ensembles gains ~1% vs 5 but costs ~4x

Model Optimization

model merging (DARE‑TIES / Mergekit)LoRA

System Optimization

use of SFR-Embedding-Mistral or PubmedBERT embeddings depending on example DB size

Training Optimization

templating to increase prompt varietyDEITA filtering to prune low‑quality pairssynthetic CoT generation to enrich train splits

Inference Optimization

Medprompt: KNN few‑shot retrieval + CoT + ensemble majority votingSelf‑consistency CoT (ensembles)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCC BY-NC 4.0 (for the released DPO model)

Code URLs

prompting repository (authors stated to share; URL in paper references and appendix)merge configurations (shared by authors)

Data URLs

training data and CoT examples (authors state they distribute training data for the released DPO model)

Risks & Boundaries

Limitations

Not intended for clinical use; authors explicitly prohibit clinical deployment.

DPO reduces but does not eliminate unsafe outputs; jailbreaks remain effective.

When Not To Use

Do not use as a standalone diagnostic or treatment tool.

Avoid deploying in unsupervised patient‑facing systems without human oversight.

Failure Modes

Hallucinations in factual answers despite CoT augmentation.

Jailbreaks and instruction‑injection bypassing alignment.

Core Entities

Models

Mistral-7BLlama 3 8BMixtral-8x7BMistral-Aloe-7B-v1Llama3-Aloe-8B-v1Llama3-Aloe-8B-Merged-v1Llama3-Aloe-8B-Merged-DPO-RT-v1 (Alpha)

Metrics

AccuracyAttack Success Rate (ASR)pct_stereotype (CrowsPairs)toxicity (Toxigen)ethics score (Hendrycks Ethics)

Datasets

MedQAMedMCQAPubMedQACareQAMMLU (medical subsets)MedQuADMedInstructMedical Guidelines (EPFL) syntheticasclepius_*medmcqa_cot/pubmedqa_cot (synthetic CoT)

Benchmarks

MultiMedQAMedMCQAMedQAPubMedQAMMLU-MedCareQACrowsPairsHendrycks EthicsTruthfulQAToxigen/Toxigen Generation

Context Entities

Models

SFR-Embedding-Mistralpubmedbert-base-embeddingMedCPT-Query-EncoderUAE-Large-v1

Metrics

ensemble majority votingself-consistency sampling

Datasets

HelpSteerargilla_dpo-mix-7kAnthropic Harmless (seed ideas)custom_redteaming_dataset (1,386 entries)