One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

Overview

Decision SnapshotReady For Pilot

The method reduces pipeline complexity and shows consistent benchmark and expert-evaluation gains, but the model is still not safe for autonomous medical advice; apply with clinician oversight.

Citations24

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, Benyou Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors propose a one-stage domain adaptation protocol that converts diverse medical pre-training corpora into unified instruction–response pairs and trains domain and instruction data together with a priority sampler. Using this recipe they release HuatuoGPT-II (7B/13B) for Chinese medical tasks. One-stage training improves stability and generalization versus the standard two-stage pipeline and yields state-of-the-art results on multiple Chinese medical benchmarks and a fresh 2023 pharmacist exam used to reduce data contamination risks.

Problem Statement

Two-stage domain adaption (continued pre-training then supervised fine-tuning) introduces optimization mismatch, two distribution shifts, and catastrophic forgetting. This pipeline is complex to tune and can reduce model prompting ability. The authors ask: can a single-stage protocol unify heterogeneous pre-training and SFT data to inject domain knowledge more stably and effectively?

Main Contribution

A one-stage domain adaptation protocol that rewrites pre-training corpora into instruction–response pairs and trains all data together with a priority sampler.

HuatuoGPT-II, a Chinese medical model trained with the one-stage protocol (7B and 13B variants) with public code/data.

Key Findings

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers5.3%–23% relative gains on six datasets (one-stage vs two-stage)

Practical UseConvert pre-training text to instruction–response and train together to get measurably better medical performance and more stable training.

Evidence RefSection 4.4 / Fig.6

HuatuoGPT-II (13B) achieves strong medical benchmark scores

Numbers58.47 average on selected Chinese medical benchmarks (HuatuoGPT-II 13B)

Practical UseA medium-sized model trained with one-stage adaption can reach near-top performance for Chinese medical tasks; consider this pipeline instead of only scaling model size.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average medical benchmark score	58.47	Baichuan2-13B-Chat 49.77	+8.7	Average across MedQA, MedMCQA, CMB, CMExam, CMMLU (medical), C-Eval (medical)	Table 1: HuatuoGPT-II (13B) average 58.47 vs Baichuan2-13B-Chat 49.77	Table 1
2023 Pharmacist Licensure Exam (total score)	52.9	GPT-4 57.3	-4.4	2023 Pharmacist Licensure Examination (Pharmacy track)	Table 2: HuatuoGPT-II (13B) total 52.9, GPT-4 57.3	Table 2

What To Try In 7 Days

Convert a small fraction of your domain corpus into instruction–response pairs and run mixed one-stage training.

Implement a priority sampler (β≈2) that starts with knowledge-dense data then shifts to SFT-style data.

Run a fresh, out-of-time evaluation set (like a recent exam) to check data contamination.

Optimization Features

Token Efficiency

4096 token training sequences by concatenating instructions

System Optimization

ZeRO-based distributed training across 8 A100 GPUs

Training Optimization

One-stage unified training to reduce distribution shiftPriority sampling (β scheduling) to order data exposure

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FreedomIntelligence/HuatuoGPT-II

Data URLs

https://github.com/FreedomIntelligence/HuatuoGPT-II

Risks & Boundaries

Limitations

Not certified for clinical use; authors explicitly warn against using it for medical advice.

Main evaluation and data focus on Chinese; English performance lags on English-native exams.

When Not To Use

Do not deploy as an unsupervised clinical decision system.

Avoid using it as the sole source for critical medical recommendations.

Failure Modes

Hallucinations in Chinese medical phrasing remain possible.

Performance depends on quality of data unification; poor unification can inject errors.

Core Entities

Models

HuatuoGPT-II (7B)HuatuoGPT-II (13B)Baichuan2-7B-BaseBaichuan2-13B-BaseBaichuan2-7B-ChatBaichuan2-13B-ChatQwen-7B-ChatQwen-14B-ChatChatGLM3-6BHuatuoGPTDISC-MedLLMChatGPTGPT-4ERNIE Bot

Metrics

Accuracywin rate (pairwise)average score (exam total)automated quality scores (GPT-4 ratings)

Datasets

Huatuo-26MMedQAMedMCQACMBCMExamC-EvalCMMLUKUAKE-QICMed-dialogShareGPT2023 Chinese Pharmacist Licensure ExamPubMedC4

Benchmarks

MedQAMedMCQACMBCMExamC-Eval (medical parts)CMMLU (medical parts)2023 Pharmacist Licensure Examination

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

One-stage training outperforms conventional two-stage adaption across medical datasets

HuatuoGPT-II (13B) achieves strong medical benchmark scores

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding