Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
24
Why It Matters For Business
One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.
Summary TLDR
The authors propose a one-stage domain adaptation protocol that converts diverse medical pre-training corpora into unified instruction–response pairs and trains domain and instruction data together with a priority sampler. Using this recipe they release HuatuoGPT-II (7B/13B) for Chinese medical tasks. One-stage training improves stability and generalization versus the standard two-stage pipeline and yields state-of-the-art results on multiple Chinese medical benchmarks and a fresh 2023 pharmacist exam used to reduce data contamination risks.
Problem Statement
Two-stage domain adaption (continued pre-training then supervised fine-tuning) introduces optimization mismatch, two distribution shifts, and catastrophic forgetting. This pipeline is complex to tune and can reduce model prompting ability. The authors ask: can a single-stage protocol unify heterogeneous pre-training and SFT data to inject domain knowledge more stably and effectively?
Main Contribution
A one-stage domain adaptation protocol that rewrites pre-training corpora into instruction–response pairs and trains all data together with a priority sampler.
HuatuoGPT-II, a Chinese medical model trained with the one-stage protocol (7B and 13B variants) with public code/data.
A fresh evaluation using the 2023 Chinese National Pharmacist Licensure Examination to reduce test-data leakage and test generalization.
Key Findings
One-stage training outperforms conventional two-stage adaption across medical datasets
HuatuoGPT-II (13B) achieves strong medical benchmark scores
Experts judge HuatuoGPT-II competitive with GPT-4 on medical responses
Data unification is necessary for the one-stage benefit
Results
Average medical benchmark score
2023 Pharmacist Licensure Exam (total score)
Human expert pairwise favorable rate vs GPT-4
Automated pairwise win rate vs GPT-4 (GPT-4 judge)
Who Should Care
What To Try In 7 Days
Convert a small fraction of your domain corpus into instruction–response pairs and run mixed one-stage training.
Implement a priority sampler (β≈2) that starts with knowledge-dense data then shifts to SFT-style data.
Run a fresh, out-of-time evaluation set (like a recent exam) to check data contamination.
Optimization Features
Token Efficiency
- 4096 token training sequences by concatenating instructions
System Optimization
- ZeRO-based distributed training across 8 A100 GPUs
Training Optimization
- One-stage unified training to reduce distribution shift
- Priority sampling (β scheduling) to order data exposure
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Not certified for clinical use; authors explicitly warn against using it for medical advice.
- Main evaluation and data focus on Chinese; English performance lags on English-native exams.
- Dependence on LLMs (e.g., ChatGPT) for data unification may raise provenance concerns, though authors provide self-unification experiments.
When Not To Use
- Do not deploy as an unsupervised clinical decision system.
- Avoid using it as the sole source for critical medical recommendations.
- If your domain data cannot be rewritten into faithful instruction–response pairs, the one-stage benefit may not appear.
Failure Modes
- Hallucinations in Chinese medical phrasing remain possible.
- Performance depends on quality of data unification; poor unification can inject errors.
- Mis-tuned priority β (too low or too high) degrades results.
Core Entities
Models
- HuatuoGPT-II (7B)
- HuatuoGPT-II (13B)
- Baichuan2-7B-Base
- Baichuan2-13B-Base
- Baichuan2-7B-Chat
- Baichuan2-13B-Chat
- Qwen-7B-Chat
- Qwen-14B-Chat
- ChatGLM3-6B
- HuatuoGPT
- DISC-MedLLM
- ChatGPT
- GPT-4
- ERNIE Bot
Metrics
- Accuracy
- win rate (pairwise)
- average score (exam total)
- automated quality scores (GPT-4 ratings)
Datasets
- Huatuo-26M
- MedQA
- MedMCQA
- CMB
- CMExam
- C-Eval
- CMMLU
- KUAKE-QIC
- Med-dialog
- ShareGPT
- 2023 Chinese Pharmacist Licensure Exam
- PubMed
- C4
Benchmarks
- MedQA
- MedMCQA
- CMB
- CMExam
- C-Eval (medical parts)
- CMMLU (medical parts)
- 2023 Pharmacist Licensure Examination

