ChiMed‑GPT: a 13B Chinese medical LLM trained with pretraining, SFT and RLHF for safer, better medical answers

November 10, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper provides clear training recipes, public code/model, and multiple benchmark results; however, evaluation focuses on Chinese datasets and limited human eval samples, so deploy cautiously and validate on your data.

Citations10

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 7/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 45%

Authors

Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, Yongdong Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.

Who Should Care

Summary TLDR

ChiMed‑GPT is a 13B Chinese medical LLM built from Ziya-13B-v2 (4096 context) and trained with a full pipeline: continued pre-training on medical text, supervised fine-tuning on QA/dialogue, and RLHF via rejection sampling. The team augmented a 4K reward set with GPT-3.5/4 replies, trained a reward model, and used rejection sampling to align outputs. On Chinese medical benchmarks and human evaluations, ChiMed‑GPT outperforms many open-source medical and general models, beats GPT-4 on open-ended QA and dialogue generation metrics, and shows lower bias on mental-health attitude scales.

Problem Statement

Most Chinese medical LLMs use only supervised fine-tuning, rely on limited data sources, and are limited to 2,048 tokens. That reduces domain knowledge capture, harms alignment with human preference, and limits handling long clinical texts. The paper builds a domain model trained with pre-training, SFT, and RLHF to address these gaps.

Main Contribution

Built ChiMed‑GPT by continuing Ziya-13B-v2 and keeping 4,096 token context for longer medical texts.

Applied a full training regime: continued pre‑training on CMD, extensive SFT on multiple QA/dialogue corpora, and RLHF via rejection sampling.

Key Findings

Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.

NumbersBLEU-1 33.14 (ChiMed‑GPT) vs 24.29 (GPT-4)

Practical UseUse ChiMed‑GPT for free-text Chinese medical answers; it produces more human-like wording on open-ended medical QA than GPT-4 on the evaluated set.

Evidence RefTable 5 (open-ended QA / ChiMed)

Multi-turn dialogue metrics (ROUGE-1) strongly favor ChiMed‑GPT over GPT-4 on the tested medical dialogue set.

NumbersROUGE-1 43.43 (ChiMed‑GPT) vs 20.64 (GPT-4)

Practical UseFor medical chatbots, ChiMed‑GPT produces more on-target and content-rich responses on the evaluated Chinese dialogue dataset.

Evidence RefTable 8 (multi-turn dialogue)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
NER F1 (CCKS-2019, five-shot)40.82 (ChiMed‑GPT)41.37 (GPT-4)-0.55CCKS-2019Table 4 reports five-shot F1Table 4
NER F1 (ChiMST, five-shot)41.04 (ChiMed‑GPT)41.25 (GPT-4)-0.21ChiMSTTable 4 five-shot F1Table 4

What To Try In 7 Days

Run the released model on real FAQ or triage dialogs to compare answer clarity vs your current system.

Validate the model with your own medical QA pairs and check multi-turn behavior on long patient histories.

Use the provided SFT + RLHF recipe (reward augmentation + rejection sampling) on a small internal dataset to align outputs to your style and safety rules.

Agent Features

Memory
extended context length 4,096 tokens
Architectures
Transformer decoder (13B)

Optimization Features

Token Efficiency
longer context (4096 tokens) for longer documents
Training Optimization
bf16 mixed-precisionZeRO optimizer shardingflash-attentiontensor parallelism (Megatron-LM)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Multi-choice exam accuracy still lags behind GPT-4 on some datasets (C-Eval/CMMLU).

Human evaluation used small samples (50 QA, 50 dialogues) — larger user studies are needed.

When Not To Use

When you need exam-level multi-choice accuracy comparable to GPT-4 on all medical benchmarks.

When legal/regulatory guarantees or certified clinical-grade performance are required.

Failure Modes

May hallucinate facts or produce incorrect medical suggestions; always verify with clinicians.

May still underperform on highly specialized sub-domains not covered in CMD or SFT data.

Core Entities

Models

ChiMed-GPTZiya-13B-v2GPT-4GPT-3.5-TurboZiya-v1Ziya-v2BaichuanTaiyiMedicalGPTBenTsao

Metrics

AccuracyF1BLEUROUGEHuman eval (fluency, completeness, precision)CAMI bias scoreMICA bias score

Datasets

CMD (Pre-train)SFTCMD (Reward)ChiMedCCKS-2019ChiMSTC-Eval (medical subsets)CMMLU (medical subsets)MedQA (Chinese subset)MC (multi-choice? dataset)MedDialogSafety-Prompts

Benchmarks

NER (CCKS-2019, ChiMST)Multi-choice QA (C-Eval, CMMLU, MedQA)Open-ended QA (ChiMed)Multi-turn dialogue (MC)