Mix ChatGPT-distilled text with real doctor dialogs, then use RL from AI feedback to make an open-source Chinese medical chatbot that acts (

May 24, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

18

Authors

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, Haizhou Li

Links

Abstract / PDF

Why It Matters For Business

HuatuoGPT offers an open-source Chinese medical assistant that is more interactive and clinically oriented than prior open models; this lowers integration cost for localized medical chat services but still needs clinical oversight before deployment.

Summary TLDR

HuatuoGPT is a Chinese medical LLM built by supervised fine-tuning on a mix of ChatGPT-distilled data and real doctor conversations, then refined with reinforcement learning using AI (RLAIF) as a reward signal. The model (based on BLOOMZ-7b1-mt) scores higher than other open-source Chinese medical LLMs on automatic metrics and human/GPT-4 reviews, and it asks follow-up questions like a doctor. The authors publish code, data and models but caution that generation-based medical advice still needs careful verification before clinical use.

Problem Statement

General LLMs like ChatGPT produce fluent and informative text but do not behave like doctors (they avoid diagnoses, rarely ask clarifying questions, and can hallucinate). Pure real-world doctor dialogues are accurate but short, inconsistent, and less patient-friendly. The paper asks: can we combine both data types and use AI feedback to train an LLM that is both patient-friendly and doctor-like?

Main Contribution

A two-stage recipe: supervised fine-tuning on hybrid data (ChatGPT-distilled + real doctor instruction/conversation) followed by RL using an AI-trained reward model (RLAIF).

A public Chinese medical LLM (HuatuoGPT) and associated reward model, code and datasets released on GitHub.

A systematic evaluation: automatic benchmarks, GPT-4 pairwise scoring, and human physician evaluation showing improved interactive diagnosis behavior.

Key Findings

HuatuoGPT wins most manual single-turn comparisons vs. other open-source Chinese medical models.

NumbersHuatuoGPT manual win rate vs DoctorGLM 98% (single-turn)

HuatuoGPT produces better multi-turn interactive diagnoses than many baselines.

NumbersHuatuoGPT manual win rate vs DoctorGLM 86% (multi-turn); vs ChatGPT 58% (multi-turn)

On Chinese medical QA benchmarks, HuatuoGPT attains higher n-gram overlap scores than zero-shot GPT-3.5 and matches or approaches fine-tuned baselines.

NumbersBLEU-1: cMedQA2 HuatuoGPT 25.37 vs GPT-3.5 19.21; webMedQA 24.61 vs 18.06

RLAIF changes model behavior toward doctor-like interaction (asking follow-ups).

NumbersAblation: model w/o RLAIF did not ask follow-up questions (qualitative)

Results

BLEU-1 (cMedQA2)

Value25.37

BaselineGPT-3.5-turbo 19.21; T5 (finetuned) 20.88

BLEU-1 (webMedQA)

Value24.61

BaselineGPT-3.5-turbo 18.06; T5 21.42

GPT-4 automated pairwise overall score ratio

ValueHuatuoGPT set to 1.0

BaselineGPT-3.5-turbo overall 0.77 (approx)

Manual evaluation win rate (single-turn)

Value98% vs DoctorGLM; 52% vs ChatGPT; 10.5% vs GPT-4

BaselineDoctorGLM, ChatGPT, GPT-4

Manual evaluation win rate (multi-turn)

Value86% vs DoctorGLM; 58% vs ChatGPT

BaselineDoctorGLM, ChatGPT

Who Should Care

What To Try In 7 Days

Run the HuatuoGPT demo or GitHub model on a small set of local FAQs to compare answers vs current chatbot.

Fine-tune the released model on your clinic's anonymized QA logs to align local practice.

Use the provided reward-model pipeline to nudge the chatbot toward asking clarifying questions on incomplete inputs.

Agent Features

Memory

  • short-term dialogue history (multi-turn context)

Architectures

  • BLOOMZ-7b1-mt (base architecture)

Optimization Features

Infra Optimization

  • Trained across 8 A100 GPUs

System Optimization

  • ZeRO-3 for distributed training

Training Optimization

  • SFT
  • During RL only last two layers updated

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Generation-based medical advice is hard to verify and can hallucinate; authors warn against direct clinical deployment.
  • Evaluation focuses on Chinese datasets and may not generalize to other languages or regions.
  • Reward model uses LLM scoring (AI feedback), which risks inheriting judge bias from the scoring LLM.

When Not To Use

  • Do not use as a sole decision-maker for critical or emergency diagnoses.
  • Avoid deploying without human clinician oversight and a verification pipeline.

Failure Modes

  • Hallucinated or incorrect diagnosis presented with confident language.
  • Missing rare or localized conditions not present in training data.
  • Bias toward patterns in distilled ChatGPT data (overly general advice) or concise doctor notes (too terse) depending on mixture.

Core Entities

Models

  • HuatuoGPT
  • BLOOMZ-7b1-mt
  • GPT-3.5-turbo
  • GPT-4
  • BenTsao
  • DoctorGLM
  • T5

Metrics

  • BLEU
  • ROUGE
  • GLEU
  • Distinct-1/2
  • GPT-4 pairwise scoring
  • Physician manual win rate

Datasets

  • cMedQA2
  • webMedQA
  • Huatuo-26M
  • KUAKE-QIC
  • Medicinal real-world doctor dialogs (refined)

Benchmarks

  • CBLUE / KUAKE-QIC single-turn set
  • cMedQA2
  • webMedQA
  • Huatuo-26M