ChiMed‑GPT: a 13B Chinese medical LLM trained with pretraining, SFT and RLHF for safer, better medical answers

November 10, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.5

Citation Count

10

Authors

Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, Yongdong Zhang

Links

Abstract / PDF

Why It Matters For Business

ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.

Summary TLDR

ChiMed‑GPT is a 13B Chinese medical LLM built from Ziya-13B-v2 (4096 context) and trained with a full pipeline: continued pre-training on medical text, supervised fine-tuning on QA/dialogue, and RLHF via rejection sampling. The team augmented a 4K reward set with GPT-3.5/4 replies, trained a reward model, and used rejection sampling to align outputs. On Chinese medical benchmarks and human evaluations, ChiMed‑GPT outperforms many open-source medical and general models, beats GPT-4 on open-ended QA and dialogue generation metrics, and shows lower bias on mental-health attitude scales.

Problem Statement

Most Chinese medical LLMs use only supervised fine-tuning, rely on limited data sources, and are limited to 2,048 tokens. That reduces domain knowledge capture, harms alignment with human preference, and limits handling long clinical texts. The paper builds a domain model trained with pre-training, SFT, and RLHF to address these gaps.

Main Contribution

Built ChiMed‑GPT by continuing Ziya-13B-v2 and keeping 4,096 token context for longer medical texts.

Applied a full training regime: continued pre‑training on CMD, extensive SFT on multiple QA/dialogue corpora, and RLHF via rejection sampling.

Created a reward dataset by augmenting CMD(Reward) with GPT-3.5/GPT-4 outputs to produce fine-grained preference labels.

Showed improved task performance (NER, QA, multi-turn dialogue) and lower bias on CAMI/MICA human-attitude scales.

Released code and model weights for community use.

Key Findings

Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.

NumbersBLEU-1 33.14 (ChiMed‑GPT) vs 24.29 (GPT-4)

Multi-turn dialogue metrics (ROUGE-1) strongly favor ChiMed‑GPT over GPT-4 on the tested medical dialogue set.

NumbersROUGE-1 43.43 (ChiMed‑GPT) vs 20.64 (GPT-4)

On multi-choice benchmarks (C-Eval, CMMLU), ChiMed‑GPT is competitive but below GPT-4.

NumbersC-Eval acc 68.29 (ChiMed‑GPT) vs 71.29 (GPT-4); CMMLU 52.92 vs 69.55

Human ratings (50 sampled QA pairs): ChiMed‑GPT had the best averaged scores among compared open-source models.

NumbersQA human eval Fluency 2.57, Completeness 2.45, Precision 2.57 (scale 1–3)

Bias analysis: ChiMed‑GPT showed the lowest average bias among compared models on two clinician/public attitude scales.

NumbersLowest average scores on CAMI and MICA compared to listed models

Results

NER F1 (CCKS-2019, five-shot)

Value40.82 (ChiMed‑GPT)

Baseline41.37 (GPT-4)

NER F1 (ChiMST, five-shot)

Value41.04 (ChiMed‑GPT)

Baseline41.25 (GPT-4)

Accuracy

Value68.29 (ChiMed‑GPT)

Baseline71.29 (GPT-4)

Open-ended QA (BLEU-1 / ChiMed, zero-shot)

Value33.14 (BLEU-1, ChiMed)

Baseline24.29 (GPT-4)

Multi-turn dialogue (ROUGE-1, zero-shot)

Value43.43 (ChiMed‑GPT)

Baseline20.64 (GPT-4)

Human eval (QA) — Precision (1–3)

Value2.57 (ChiMed‑GPT)

Baseline2.3 (MedicalGPT Z) / 2.0+ (others)

Bias (average CAMI/MICA)

ValueLowest among compared models on CAMI and MICA

BaselineHigher scores for GPT-4, GPT-3.5, Ziya, Baichuan, Taiyi, MedicalGPT

Who Should Care

What To Try In 7 Days

Run the released model on real FAQ or triage dialogs to compare answer clarity vs your current system.

Validate the model with your own medical QA pairs and check multi-turn behavior on long patient histories.

Use the provided SFT + RLHF recipe (reward augmentation + rejection sampling) on a small internal dataset to align outputs to your style and safety rules.

Agent Features

Memory

  • extended context length 4,096 tokens

Architectures

  • Transformer decoder (13B)

Optimization Features

Token Efficiency

  • longer context (4096 tokens) for longer documents

Training Optimization

  • bf16 mixed-precision
  • ZeRO optimizer sharding
  • flash-attention
  • tensor parallelism (Megatron-LM)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Multi-choice exam accuracy still lags behind GPT-4 on some datasets (C-Eval/CMMLU).
  • Human evaluation used small samples (50 QA, 50 dialogues) — larger user studies are needed.
  • Reward set is small (4K instances) and augmented automatically; alignment quality depends on that augmentation.

When Not To Use

  • When you need exam-level multi-choice accuracy comparable to GPT-4 on all medical benchmarks.
  • When legal/regulatory guarantees or certified clinical-grade performance are required.
  • When your application requires retrieval-augmented or evidence-grounded citation (model is not RAG).

Failure Modes

  • May hallucinate facts or produce incorrect medical suggestions; always verify with clinicians.
  • May still underperform on highly specialized sub-domains not covered in CMD or SFT data.
  • Alignment via rejection sampling can prefer fluent answers that are not clinically correct.

Core Entities

Models

  • ChiMed-GPT
  • Ziya-13B-v2
  • GPT-4
  • GPT-3.5-Turbo
  • Ziya-v1
  • Ziya-v2
  • Baichuan
  • Taiyi
  • MedicalGPT
  • BenTsao

Metrics

  • Accuracy
  • F1
  • BLEU
  • ROUGE
  • Human eval (fluency, completeness, precision)
  • CAMI bias score
  • MICA bias score

Datasets

  • CMD (Pre-train)
  • SFT
  • CMD (Reward)
  • ChiMed
  • CCKS-2019
  • ChiMST
  • C-Eval (medical subsets)
  • CMMLU (medical subsets)
  • MedQA (Chinese subset)
  • MC (multi-choice? dataset)
  • MedDialog
  • Safety-Prompts

Benchmarks

  • NER (CCKS-2019, ChiMST)
  • Multi-choice QA (C-Eval, CMMLU, MedQA)
  • Open-ended QA (ChiMed)
  • Multi-turn dialogue (MC)