One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

November 16, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

24

Authors

Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, Benyou Wang

Links

Abstract / PDF

Why It Matters For Business

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Summary TLDR

The authors propose a one-stage domain adaptation protocol that converts diverse medical pre-training corpora into unified instruction–response pairs and trains domain and instruction data together with a priority sampler. Using this recipe they release HuatuoGPT-II (7B/13B) for Chinese medical tasks. One-stage training improves stability and generalization versus the standard two-stage pipeline and yields state-of-the-art results on multiple Chinese medical benchmarks and a fresh 2023 pharmacist exam used to reduce data contamination risks.

Problem Statement

Two-stage domain adaption (continued pre-training then supervised fine-tuning) introduces optimization mismatch, two distribution shifts, and catastrophic forgetting. This pipeline is complex to tune and can reduce model prompting ability. The authors ask: can a single-stage protocol unify heterogeneous pre-training and SFT data to inject domain knowledge more stably and effectively?

Main Contribution

A one-stage domain adaptation protocol that rewrites pre-training corpora into instruction–response pairs and trains all data together with a priority sampler.

HuatuoGPT-II, a Chinese medical model trained with the one-stage protocol (7B and 13B variants) with public code/data.

A fresh evaluation using the 2023 Chinese National Pharmacist Licensure Examination to reduce test-data leakage and test generalization.

Key Findings

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers5.3%–23% relative gains on six datasets (one-stage vs two-stage)

HuatuoGPT-II (13B) achieves strong medical benchmark scores

Numbers58.47 average on selected Chinese medical benchmarks (HuatuoGPT-II 13B)

Experts judge HuatuoGPT-II competitive with GPT-4 on medical responses

Numbers73% favorable (win or tie) in licensed-physician pairwise comparison vs GPT-4

Data unification is necessary for the one-stage benefit

NumbersCMExam: one-stage with unification 53.4 vs one-stage without 49.3 (Δ=4.1)

Results

Average medical benchmark score

Value58.47

BaselineBaichuan2-13B-Chat 49.77

2023 Pharmacist Licensure Exam (total score)

Value52.9

BaselineGPT-4 57.3

Human expert pairwise favorable rate vs GPT-4

Value73%

Baseline

Automated pairwise win rate vs GPT-4 (GPT-4 judge)

Value78%

Baseline

Who Should Care

What To Try In 7 Days

Convert a small fraction of your domain corpus into instruction–response pairs and run mixed one-stage training.

Implement a priority sampler (β≈2) that starts with knowledge-dense data then shifts to SFT-style data.

Run a fresh, out-of-time evaluation set (like a recent exam) to check data contamination.

Optimization Features

Token Efficiency

  • 4096 token training sequences by concatenating instructions

System Optimization

  • ZeRO-based distributed training across 8 A100 GPUs

Training Optimization

  • One-stage unified training to reduce distribution shift
  • Priority sampling (β scheduling) to order data exposure

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not certified for clinical use; authors explicitly warn against using it for medical advice.
  • Main evaluation and data focus on Chinese; English performance lags on English-native exams.
  • Dependence on LLMs (e.g., ChatGPT) for data unification may raise provenance concerns, though authors provide self-unification experiments.

When Not To Use

  • Do not deploy as an unsupervised clinical decision system.
  • Avoid using it as the sole source for critical medical recommendations.
  • If your domain data cannot be rewritten into faithful instruction–response pairs, the one-stage benefit may not appear.

Failure Modes

  • Hallucinations in Chinese medical phrasing remain possible.
  • Performance depends on quality of data unification; poor unification can inject errors.
  • Mis-tuned priority β (too low or too high) degrades results.

Core Entities

Models

  • HuatuoGPT-II (7B)
  • HuatuoGPT-II (13B)
  • Baichuan2-7B-Base
  • Baichuan2-13B-Base
  • Baichuan2-7B-Chat
  • Baichuan2-13B-Chat
  • Qwen-7B-Chat
  • Qwen-14B-Chat
  • ChatGLM3-6B
  • HuatuoGPT
  • DISC-MedLLM
  • ChatGPT
  • GPT-4
  • ERNIE Bot

Metrics

  • Accuracy
  • win rate (pairwise)
  • average score (exam total)
  • automated quality scores (GPT-4 ratings)

Datasets

  • Huatuo-26M
  • MedQA
  • MedMCQA
  • CMB
  • CMExam
  • C-Eval
  • CMMLU
  • KUAKE-QIC
  • Med-dialog
  • ShareGPT
  • 2023 Chinese Pharmacist Licensure Exam
  • PubMed
  • C4

Benchmarks

  • MedQA
  • MedMCQA
  • CMB
  • CMExam
  • C-Eval (medical parts)
  • CMMLU (medical parts)
  • 2023 Pharmacist Licensure Examination