Aloe: open 7B–8B medical LLMs using synthetic Chain-of-Thought, model merging and Direct Preference Optimization

May 3, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Jordi Bayarri-Planas, Adrian Tormos, Daniel Hinjos, Pablo Bernabeu-Perez, Anna Arias-Duart, Pablo Agustin Martin-Torres, Lucia Urcelay-Ganzabal, Marta Gonzalez-Mallo, Sergio Alvarez-Napagao, Eduard Ayguadé-Parra, Ulises Cortés Dario Garcia-Gasulla

Links

Abstract / PDF

Why It Matters For Business

Aloe shows practical, low‑cost ways to push open medical LLMs: generate CoT examples with a stronger model, merge fine‑tuned variants, and use retrieval‑style prompting to get 2–7 point accuracy gains without larger models or expensive pretraining.

Summary TLDR

Aloe is a family of open medical LLMs (7B–8B) built by fine-tuning strong open base models (Mistral, LLaMA 3). The team mixed large curated medical QA collections, synthetically generated Chain-of-Thought (CoT) answers, model merging, and Direct Preference Optimization (DPO) alignment with red teaming. On common medical benchmarks Aloe (Llama3‑Aloe‑8B‑Alpha) is the strongest open model at its scale, gains ~1–2 pts absolute over Llama‑3‑8B in zero-shot, reaches up to 76.9% average with Medprompt, and shows modest safety improvements after DPO (ASR 0.56→0.52). The code/data/model artifacts for the DPO release are shared under CCBY-NC 4.0. Evidence: benchmark tables, ablations, and red‑teaming/

Problem Statement

Open medical LLMs lag behind closed models. Continued pretraining has limited payoff. Which combination of instruct tuning, synthetic CoT, model merging, alignment (DPO) and prompt engineering gives the best practical gains for competitive, open healthcare LLMs?

Main Contribution

Built Aloe, a family of open healthcare LLMs (Mistral‑ and LLaMA‑based) in the 7B–8B range and released an aligned DPO variant under CCBY‑NC 4.0.

Assembled and cleaned a large supervised fine‑tuning mix: ~348k curated medical QA pairs plus general QA and 210k+ synthetically CoT‑enhanced items, totaling ~500M tokens for SFT.

Generated synthetic Chain‑of‑Thought answers using Mixtral‑8x7B to improve benchmark training splits and produce a CoT example database for retrieval‑style prompting.

Applied model merging (Mergekit/DARE‑TIES) to combine fine‑tuned model variants, yielding the largest single-model gain in ablations.

Performed two‑stage Direct Preference Optimization (DPO) alignment guided by a red‑teaming dataset and measured safety with Attack Success Rate (ASR) and standard bias/toxicity benchmarks.

Systematically evaluated advanced inference strategies (Self‑Consistency CoT and Medprompt: KNN few‑shot + CoT + ensemble voting) and published prompting repo.

Key Findings

Aloe's aligned 8B variant outperforms Llama‑3‑8B‑Instruct across medical benchmarks at this size.

NumbersZero‑shot avg: 70.25 vs 68.89 (Llama‑3‑8B) — Table 3

Medprompt (KNN few‑shot + CoT + majority voting) substantially improves accuracy at inference.

NumbersMedprompt peak avg 76.88% for Aloe (20 ensembles) — Table 25; reported +7% vs baseline

Model merging gave the largest single gain in the ablation study.

NumbersLlama3‑Aloe‑8B‑Merged avg 70.31 vs Llama3‑Aloe‑8B‑v1 65.16 (+5.15) — Table 6

DPO alignment plus red teaming reduced unsafe responses modestly.

NumbersASR overall dropped 0.56 → 0.52 after DPO (lower is better) — Fig.4/§5.3

Synthetic CoT generation improved training examples and retrieval prompts.

NumbersGenerated CoT for MedQA/MedMCQA/PubMedQA and added 34,219 QA from guidelines; final SFT dataset ≈500M tokens — §3.2 & A.

Results

Accuracy

Value70.25

BaselineLlama‑3‑8B‑Instruct 68.89

Accuracy

Value76.88

BaselineMeta Llama‑3‑8B pubmedbert 74.06 (same setting)

Ablation: merging effect (average medical score)

Value70.31 (Llama3 merged)

Baseline65.16 (Llama3 v1)

Attack Success Rate (ASR) overall

Value0.52 (post‑DPO)

Baseline0.56 (pre‑DPO)

Accuracy

Value76.88

Who Should Care

What To Try In 7 Days

Build a small CoT example DB by prompting a stronger model on your task and reuse it for KNN few‑shot prompts.

Evaluate model merging (Mergekit) on two or three fine‑tuned checkpoints before buying larger models.

Run a focused red‑teaming pass and create a tiny DPO preference set for the worst‑case prompts you care about.

Optimization Features

Token Efficiency

  • use of 5 few‑shots recommended (balance of quality and context length)
  • tradeoff: 20 ensembles gains ~1% vs 5 but costs ~4x

Model Optimization

  • model merging (DARE‑TIES / Mergekit)
  • LoRA

System Optimization

  • use of SFR-Embedding-Mistral or PubmedBERT embeddings depending on example DB size

Training Optimization

  • templating to increase prompt variety
  • DEITA filtering to prune low‑quality pairs
  • synthetic CoT generation to enrich train splits

Inference Optimization

  • Medprompt: KNN few‑shot retrieval + CoT + ensemble majority voting
  • Self‑consistency CoT (ensembles)

Reproducibility

License

  • CC BY-NC 4.0 (for the released DPO model)

Code Urls

  • prompting repository (authors stated to share; URL in paper references and appendix)
  • merge configurations (shared by authors)

Data Urls

  • training data and CoT examples (authors state they distribute training data for the released DPO model)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not intended for clinical use; authors explicitly prohibit clinical deployment.
  • DPO reduces but does not eliminate unsafe outputs; jailbreaks remain effective.
  • Some safety/benchmark differences may stem from differing DPO dataset sizes (authors note first‑stage DPO small ~11k).
  • Only the DPO variant is publicly released; other artifacts or full training code may be partial.

When Not To Use

  • Do not use as a standalone diagnostic or treatment tool.
  • Avoid deploying in unsupervised patient‑facing systems without human oversight.
  • Avoid use-cases requiring guaranteed non‑hallucination or regulatory medical device compliance.

Failure Modes

  • Hallucinations in factual answers despite CoT augmentation.
  • Jailbreaks and instruction‑injection bypassing alignment.
  • Overly cautious refusals or sycophancy depending on prompt phrasing.
  • Residual bias and toxic outputs on some inputs (Toxigen rising for Aloe in Table 5).

Core Entities

Models

  • Mistral-7B
  • Llama 3 8B
  • Mixtral-8x7B
  • Mistral-Aloe-7B-v1
  • Llama3-Aloe-8B-v1
  • Llama3-Aloe-8B-Merged-v1
  • Llama3-Aloe-8B-Merged-DPO-RT-v1 (Alpha)

Metrics

  • Accuracy
  • Attack Success Rate (ASR)
  • pct_stereotype (CrowsPairs)
  • toxicity (Toxigen)
  • ethics score (Hendrycks Ethics)

Datasets

  • MedQA
  • MedMCQA
  • PubMedQA
  • CareQA
  • MMLU (medical subsets)
  • MedQuAD
  • MedInstruct
  • Medical Guidelines (EPFL) synthetic
  • asclepius_*
  • medmcqa_cot/pubmedqa_cot (synthetic CoT)

Benchmarks

  • MultiMedQA
  • MedMCQA
  • MedQA
  • PubMedQA
  • MMLU-Med
  • CareQA
  • CrowsPairs
  • Hendrycks Ethics
  • TruthfulQA
  • Toxigen/Toxigen Generation

Context Entities

Models

  • SFR-Embedding-Mistral
  • pubmedbert-base-embedding
  • MedCPT-Query-Encoder
  • UAE-Large-v1

Metrics

  • ensemble majority voting
  • self-consistency sampling

Datasets

  • HelpSteer
  • argilla_dpo-mix-7k
  • Anthropic Harmless (seed ideas)
  • custom_redteaming_dataset (1,386 entries)