One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

November 16, 20237 min

Overview

Decision SnapshotReady For Pilot

The method reduces pipeline complexity and shows consistent benchmark and expert-evaluation gains, but the model is still not safe for autonomous medical advice; apply with clinician oversight.

Citations24

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, Benyou Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Who Should Care

Summary TLDR

The authors propose a one-stage domain adaptation protocol that converts diverse medical pre-training corpora into unified instruction–response pairs and trains domain and instruction data together with a priority sampler. Using this recipe they release HuatuoGPT-II (7B/13B) for Chinese medical tasks. One-stage training improves stability and generalization versus the standard two-stage pipeline and yields state-of-the-art results on multiple Chinese medical benchmarks and a fresh 2023 pharmacist exam used to reduce data contamination risks.

Problem Statement

Two-stage domain adaption (continued pre-training then supervised fine-tuning) introduces optimization mismatch, two distribution shifts, and catastrophic forgetting. This pipeline is complex to tune and can reduce model prompting ability. The authors ask: can a single-stage protocol unify heterogeneous pre-training and SFT data to inject domain knowledge more stably and effectively?

Main Contribution

A one-stage domain adaptation protocol that rewrites pre-training corpora into instruction–response pairs and trains all data together with a priority sampler.

HuatuoGPT-II, a Chinese medical model trained with the one-stage protocol (7B and 13B variants) with public code/data.

Key Findings

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers5.3%–23% relative gains on six datasets (one-stage vs two-stage)

Practical UseConvert pre-training text to instruction–response and train together to get measurably better medical performance and more stable training.

Evidence RefSection 4.4 / Fig.6

HuatuoGPT-II (13B) achieves strong medical benchmark scores

Numbers58.47 average on selected Chinese medical benchmarks (HuatuoGPT-II 13B)

Practical UseA medium-sized model trained with one-stage adaption can reach near-top performance for Chinese medical tasks; consider this pipeline instead of only scaling model size.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average medical benchmark score58.47Baichuan2-13B-Chat 49.77+8.7Average across MedQA, MedMCQA, CMB, CMExam, CMMLU (medical), C-Eval (medical)Table 1: HuatuoGPT-II (13B) average 58.47 vs Baichuan2-13B-Chat 49.77Table 1
2023 Pharmacist Licensure Exam (total score)52.9GPT-4 57.3-4.42023 Pharmacist Licensure Examination (Pharmacy track)Table 2: HuatuoGPT-II (13B) total 52.9, GPT-4 57.3Table 2

What To Try In 7 Days

Convert a small fraction of your domain corpus into instruction–response pairs and run mixed one-stage training.

Implement a priority sampler (β≈2) that starts with knowledge-dense data then shifts to SFT-style data.

Run a fresh, out-of-time evaluation set (like a recent exam) to check data contamination.

Optimization Features

Token Efficiency
4096 token training sequences by concatenating instructions
System Optimization
ZeRO-based distributed training across 8 A100 GPUs
Training Optimization
One-stage unified training to reduce distribution shiftPriority sampling (β scheduling) to order data exposure

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Not certified for clinical use; authors explicitly warn against using it for medical advice.

Main evaluation and data focus on Chinese; English performance lags on English-native exams.

When Not To Use

Do not deploy as an unsupervised clinical decision system.

Avoid using it as the sole source for critical medical recommendations.

Failure Modes

Hallucinations in Chinese medical phrasing remain possible.

Performance depends on quality of data unification; poor unification can inject errors.

Core Entities

Models

HuatuoGPT-II (7B)HuatuoGPT-II (13B)Baichuan2-7B-BaseBaichuan2-13B-BaseBaichuan2-7B-ChatBaichuan2-13B-ChatQwen-7B-ChatQwen-14B-ChatChatGLM3-6BHuatuoGPTDISC-MedLLMChatGPTGPT-4ERNIE Bot

Metrics

Accuracywin rate (pairwise)average score (exam total)automated quality scores (GPT-4 ratings)

Datasets

Huatuo-26MMedQAMedMCQACMBCMExamC-EvalCMMLUKUAKE-QICMed-dialogShareGPT2023 Chinese Pharmacist Licensure ExamPubMedC4

Benchmarks

MedQAMedMCQACMBCMExamC-Eval (medical parts)CMMLU (medical parts)2023 Pharmacist Licensure Examination