Overview
Production Readiness
0.6
Novelty Score
0.45
Cost Impact Score
0.5
Citation Count
10
Why It Matters For Business
ChiMed‑GPT is a practical open-source Chinese medical LLM that gives clearer patient-facing answers, handles longer clinical text (4,096 tokens), and lowers risky biased replies — useful for telemedicine, triage bots, and medical content generation.
Summary TLDR
ChiMed‑GPT is a 13B Chinese medical LLM built from Ziya-13B-v2 (4096 context) and trained with a full pipeline: continued pre-training on medical text, supervised fine-tuning on QA/dialogue, and RLHF via rejection sampling. The team augmented a 4K reward set with GPT-3.5/4 replies, trained a reward model, and used rejection sampling to align outputs. On Chinese medical benchmarks and human evaluations, ChiMed‑GPT outperforms many open-source medical and general models, beats GPT-4 on open-ended QA and dialogue generation metrics, and shows lower bias on mental-health attitude scales.
Problem Statement
Most Chinese medical LLMs use only supervised fine-tuning, rely on limited data sources, and are limited to 2,048 tokens. That reduces domain knowledge capture, harms alignment with human preference, and limits handling long clinical texts. The paper builds a domain model trained with pre-training, SFT, and RLHF to address these gaps.
Main Contribution
Built ChiMed‑GPT by continuing Ziya-13B-v2 and keeping 4,096 token context for longer medical texts.
Applied a full training regime: continued pre‑training on CMD, extensive SFT on multiple QA/dialogue corpora, and RLHF via rejection sampling.
Created a reward dataset by augmenting CMD(Reward) with GPT-3.5/GPT-4 outputs to produce fine-grained preference labels.
Showed improved task performance (NER, QA, multi-turn dialogue) and lower bias on CAMI/MICA human-attitude scales.
Released code and model weights for community use.
Key Findings
Open-ended QA (BLEU-1): ChiMed‑GPT scored higher than GPT-4 on the tested dataset.
Multi-turn dialogue metrics (ROUGE-1) strongly favor ChiMed‑GPT over GPT-4 on the tested medical dialogue set.
On multi-choice benchmarks (C-Eval, CMMLU), ChiMed‑GPT is competitive but below GPT-4.
Human ratings (50 sampled QA pairs): ChiMed‑GPT had the best averaged scores among compared open-source models.
Bias analysis: ChiMed‑GPT showed the lowest average bias among compared models on two clinician/public attitude scales.
Results
NER F1 (CCKS-2019, five-shot)
NER F1 (ChiMST, five-shot)
Accuracy
Open-ended QA (BLEU-1 / ChiMed, zero-shot)
Multi-turn dialogue (ROUGE-1, zero-shot)
Human eval (QA) — Precision (1–3)
Bias (average CAMI/MICA)
Who Should Care
What To Try In 7 Days
Run the released model on real FAQ or triage dialogs to compare answer clarity vs your current system.
Validate the model with your own medical QA pairs and check multi-turn behavior on long patient histories.
Use the provided SFT + RLHF recipe (reward augmentation + rejection sampling) on a small internal dataset to align outputs to your style and safety rules.
Agent Features
Memory
- extended context length 4,096 tokens
Architectures
- Transformer decoder (13B)
Optimization Features
Token Efficiency
- longer context (4096 tokens) for longer documents
Training Optimization
- bf16 mixed-precision
- ZeRO optimizer sharding
- flash-attention
- tensor parallelism (Megatron-LM)
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Multi-choice exam accuracy still lags behind GPT-4 on some datasets (C-Eval/CMMLU).
- Human evaluation used small samples (50 QA, 50 dialogues) — larger user studies are needed.
- Reward set is small (4K instances) and augmented automatically; alignment quality depends on that augmentation.
When Not To Use
- When you need exam-level multi-choice accuracy comparable to GPT-4 on all medical benchmarks.
- When legal/regulatory guarantees or certified clinical-grade performance are required.
- When your application requires retrieval-augmented or evidence-grounded citation (model is not RAG).
Failure Modes
- May hallucinate facts or produce incorrect medical suggestions; always verify with clinicians.
- May still underperform on highly specialized sub-domains not covered in CMD or SFT data.
- Alignment via rejection sampling can prefer fluent answers that are not clinically correct.
Core Entities
Models
- ChiMed-GPT
- Ziya-13B-v2
- GPT-4
- GPT-3.5-Turbo
- Ziya-v1
- Ziya-v2
- Baichuan
- Taiyi
- MedicalGPT
- BenTsao
Metrics
- Accuracy
- F1
- BLEU
- ROUGE
- Human eval (fluency, completeness, precision)
- CAMI bias score
- MICA bias score
Datasets
- CMD (Pre-train)
- SFT
- CMD (Reward)
- ChiMed
- CCKS-2019
- ChiMST
- C-Eval (medical subsets)
- CMMLU (medical subsets)
- MedQA (Chinese subset)
- MC (multi-choice? dataset)
- MedDialog
- Safety-Prompts
Benchmarks
- NER (CCKS-2019, ChiMST)
- Multi-choice QA (C-Eval, CMMLU, MedQA)
- Open-ended QA (ChiMed)
- Multi-turn dialogue (MC)

