CMExam: 60K+ Chinese medical multiple-choice questions with explanations and fine-grained annotations

June 5, 20237 min

Overview

Decision SnapshotReady For Pilot

The dataset is high-quality and ready for research benchmarking; models still lag humans and need domain tuning before clinical use.

Citations32

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 45%

Authors

Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, Michael Lingzhi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CMExam gives a reliable, large-scale way to measure clinical QA performance for Chinese medical LLMs so teams can identify domain gaps and cost-effectively fine-tune small models.

Who Should Care

Summary TLDR

CMExam is a large, authoritative Chinese medical QA dataset built from the Chinese National Medical Licensing Examination. It contains 68,119 pre-processed questions (60K+ retained), multiple-choice answers, and solution explanations. Each test question has five expert-verified annotations (disease group, clinical department, discipline, competency, difficulty). Benchmarks show GPT-4 reaches 61.6% accuracy (human 71.6%). Fine-tuning small models on CMExam closes gaps for answer prediction and improves explanation quality. The dataset and code are on GitHub for research use.

Problem Statement

Medical LLM evaluation in Chinese lacks a large, standardized, and authoritative benchmark. Existing datasets are often noisy, small, or sourced from web forums. Without a reliable Chinese medical exam dataset, we cannot measure model accuracy or reasoning across medical subfields.

Main Contribution

A large, curated Chinese medical multiple-choice dataset (CMExam) with 68,119 questions and solution explanations.

Five question-level expert annotations: ICD-11 disease groups, 36 clinical departments, 7 medical disciplines, 4 competency areas, and 5 difficulty levels.

Key Findings

GPT-4 is the top zero-shot answer predictor on CMExam.

Numbers61.6% accuracy (GPT-4) vs 71.6% (human)

Practical UseExpect state-of-the-art LLMs to perform well but still under human experts; use the dataset to measure remaining gaps.

Evidence RefTable 3

Fine-tuning small models on CMExam substantially improves accuracy.

NumbersChatGLM-CMExam 45.3% vs ChatGLM zero-shot 26.3%

Practical UseIf you need better prediction cheaply, fine-tune lightweight models (LoRA/P-tuning) on CMExam before using large API models.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy61.6%Human accuracy 71.6%-10.0 ppCMExam test setTable 3 (Prediction Acc)Table 3
Accuracy45.3%ChatGLM zero-shot 26.3%+19.0 ppCMExam test setTable 3 (Prediction Acc)Table 3

What To Try In 7 Days

Download CMExam and run your model on the test split to get baseline accuracy and error slices.

Fine-tune a small model with LoRA or P-tuning V2 on CMExam training data and compare accuracy vs zero-shot APIs.

Run per-department and per-disease-group analysis to spot weak subdomains for targeted data collection or rules-based fixes.

Agent Features

Frameworks
LoRAP-tuning V2
Architectures
decoder-onlyencoder-onlyseq2seq

Optimization Features

Infra Optimization
NVIDIA V100 used for fine-tuning
Model Optimization
LoRAP-tuning V2 (prefix length=128)
Training Optimization
SFTbatch size and learning rate tuning as reported

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Non-text questions (images/tables) were excluded, which may bias the question mix.

BLEU and ROUGE do not fully capture explanation quality; authors recommend human evaluation.

When Not To Use

Not for automated patient diagnosis or competence certification of individuals.

Not for imaging- or table-based medical tasks (those items were excluded).

Failure Modes

Models produce plausible but incorrect explanations despite correct answers.

Poor coverage and accuracy on Traditional Chinese Medicine and some rare departments.

Core Entities

Models

GPT-4GPT-3.5-turboChatGLM-6BLLaMA-7BAlpaca-7BVicuna-7BHuatuo-7BDoctorGLM-6BBERTRoBERTa

Metrics

AccuracyWeighted F1BLEU-1BLEU-4ROUGE-1ROUGE-2ROUGE-L

Datasets

CMExam

Benchmarks

CMExam benchmark (answer prediction, reasoning)

Context Entities

Models

MedAlpacaHuatuo-CMExamChatGLM-CMExam

Metrics

Accuracy

Datasets

CNMLE (source)

Benchmarks

few-shot and chain-of-thought prompting experiments