AlpaCare: fine-tuning LLaMA with a 52k machine-generated medical instruction dataset to improve medical and general instruction following

October 23, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

15

Authors

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, Linda Ruth Petzold

Links

Abstract / PDF

Why It Matters For Business

A small, diverse machine-generated medical instruction dataset can improve both medical answer quality and general instruction-following, offering a cost-effective way to build better clinical assistants while keeping development and data costs lower than large human-annotation efforts.

Summary TLDR

The authors build MedInstruct-52k, a 52,000-pair medical instruction-response dataset generated with GPT-4 (for task templates) and ChatGPT/GPT-3.5 (for answers) seeded from 167 clinician-crafted tasks. They fine-tune LLaMA-family models to create AlpaCare. On free-form medical instruction tests and several medical benchmarks, AlpaCare outperforms open-source medical and general instruction-tuned models and shows improved general-domain instruction following too. Human clinicians preferred AlpaCare outputs for correctness and helpfulness. Data, code, and models are publicly released.

Problem Statement

Existing open medical LLMs use large but narrow datasets (benchmarks, dialogues, papers) that lack instruction diversity. That limits a model's ability to follow varied medical user intents. The paper asks: can a cost-effective, machine-generated but clinician-seeded instruction dataset improve medical instruction-following and generalization?

Main Contribution

MedInstruct-52k: a 52k medical instruction–response dataset generated by prompting GPT-4 with 167 clinician seed tasks and using ChatGPT/GPT-3.5 to produce answers.

AlpaCare: instruction-finetuned LLaMA-family models using MedInstruct-52k, released with code and data.

Empirical evaluation showing large improvements on free-form medical instruction tests, medical benchmarks, general instruction benchmarks, and human preference studies.

Key Findings

AlpaCare gives large absolute gains on free-form medical instruction evaluation compared to prior baselines.

Numbersup to 38.1% absolute gain (paper claim)

AlpaCare improves average performance on general (non-medical) instruction benchmarks.

Numbers6.7% absolute average gain across several general benchmarks

Human clinicians preferred AlpaCare outputs over the best 13B baseline.

Numbers54% preferred for correctness; 69% preferred for helpfulness

MedInstruct-52k quality check by a clinician found nearly all sampled responses correct.

Numbers49/50 responses judged correct

Results

Free-form instruction evaluation (average win rate vs reference LLMs)

Value≈53.6% (iCliniq AVG judge score); ≈53.5% (MedInstruct AVG)

BaselineBest baselines ~24–39% (varies by model/test)

General-domain evaluation (AVG across AlpacaFarm, MMLU, BBH, TruthfulQA)

Value37.0% (AlpaCare AVG in Table 3)

BaselineAlpaca 30.4% AVG

Human preference (pairwise clinician study)

Value54% preferred AlpaCare for correctness; 69% preferred for helpfulness

BaselinePMC-13B (best 13B baseline)

MedInstruct-52k data quality check

Value49/50 responses judged correct by a clinician

API cost for dataset generation

Value$900 (GPT-4 task generation) + $500 (GPT-3.5 response generation)

Who Should Care

What To Try In 7 Days

Seed a small, clinician-curated task list (100–200 examples) and use GPT-4 to expand instructions.

Generate answers with a cheaper assistant (GPT-3.5) and run a rapid clinician audit on 30–50 samples.

Fine-tune an open LLaMA family model on the generated pairs and compare outputs on a small held-out clinician test.

Optimization Features

Training Optimization

  • Supervised instruction fine-tuning (cross-entropy) on 52k pairs
  • Rouge-L based dedup filtering to increase textual diversity

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Dataset and responses are generated from teacher LLMs and can inherit hallucinations or biases.
  • Evaluations are limited to held-out tests and clinician preference studies; no prospective clinical deployment tests.
  • Model may still produce incorrect medical facts and is not validated for real-world clinical use.

When Not To Use

  • Do not use AlpaCare for autonomous diagnosis, treatment decisions, or any high-stakes clinical action without expert oversight.
  • Avoid deploying in patient-facing clinical workflows without regulatory and safety validation.

Failure Modes

  • Hallucinated medical facts presented confidently.
  • Coverage gaps for rare conditions not present in generated data.
  • Teacher-model biases propagate into responses.

Core Entities

Models

  • AlpaCare
  • LLaMA
  • LLaMA-2
  • LLaMA-3
  • Alpaca
  • ChatDoctor
  • MedAlpaca
  • PMC-LLaMA
  • Baize-Healthcare
  • GPT-4
  • ChatGPT
  • GPT-3.5-turbo
  • Text-davinci-003
  • Claude-2

Metrics

  • win rate (LLM-as-judge)
  • Accuracy
  • ROUGE-L
  • linguistic entropy
  • human preference (%)

Datasets

  • MedInstruct-52k
  • MedInstruct-test
  • iCliniq
  • MedQA
  • HeadQA
  • PubmedQA
  • MedMCQA
  • MeQSum
  • AlpacaFarm
  • MMLU
  • BBH
  • TruthfulQA
  • StrategyQA
  • DROP

Benchmarks

  • iCliniq free-form instruction evaluation
  • MedInstruct-test free-form evaluation
  • MedQA
  • HeadQA
  • PubmedQA
  • MedMCQA
  • MeQSum
  • AlpacaFarm
  • MMLU
  • BBH
  • TruthfulQA