CMExam: 60K+ Chinese medical multiple-choice questions with explanations and fine-grained annotations

Overview

Decision SnapshotReady For Pilot

The dataset is high-quality and ready for research benchmarking; models still lag humans and need domain tuning before clinical use.

Citations32

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 45%

Authors

Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, Michael Lingzhi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CMExam gives a reliable, large-scale way to measure clinical QA performance for Chinese medical LLMs so teams can identify domain gaps and cost-effectively fine-tune small models.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

CMExam is a large, authoritative Chinese medical QA dataset built from the Chinese National Medical Licensing Examination. It contains 68,119 pre-processed questions (60K+ retained), multiple-choice answers, and solution explanations. Each test question has five expert-verified annotations (disease group, clinical department, discipline, competency, difficulty). Benchmarks show GPT-4 reaches 61.6% accuracy (human 71.6%). Fine-tuning small models on CMExam closes gaps for answer prediction and improves explanation quality. The dataset and code are on GitHub for research use.

Problem Statement

Medical LLM evaluation in Chinese lacks a large, standardized, and authoritative benchmark. Existing datasets are often noisy, small, or sourced from web forums. Without a reliable Chinese medical exam dataset, we cannot measure model accuracy or reasoning across medical subfields.

Main Contribution

A large, curated Chinese medical multiple-choice dataset (CMExam) with 68,119 questions and solution explanations.

Five question-level expert annotations: ICD-11 disease groups, 36 clinical departments, 7 medical disciplines, 4 competency areas, and 5 difficulty levels.

Key Findings

GPT-4 is the top zero-shot answer predictor on CMExam.

Numbers61.6% accuracy (GPT-4) vs 71.6% (human)

Practical UseExpect state-of-the-art LLMs to perform well but still under human experts; use the dataset to measure remaining gaps.

Evidence RefTable 3

Fine-tuning small models on CMExam substantially improves accuracy.

NumbersChatGLM-CMExam 45.3% vs ChatGLM zero-shot 26.3%

Practical UseIf you need better prediction cheaply, fine-tune lightweight models (LoRA/P-tuning) on CMExam before using large API models.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	61.6%	Human accuracy 71.6%	-10.0 pp	CMExam test set	Table 3 (Prediction Acc)	Table 3
Accuracy	45.3%	ChatGLM zero-shot 26.3%	+19.0 pp	CMExam test set	Table 3 (Prediction Acc)	Table 3

What To Try In 7 Days

Download CMExam and run your model on the test split to get baseline accuracy and error slices.

Fine-tune a small model with LoRA or P-tuning V2 on CMExam training data and compare accuracy vs zero-shot APIs.

Run per-department and per-disease-group analysis to spot weak subdomains for targeted data collection or rules-based fixes.

Agent Features

Frameworks

LoRAP-tuning V2

Architectures

decoder-onlyencoder-onlyseq2seq

Optimization Features

Infra Optimization

NVIDIA V100 used for fine-tuning

Model Optimization

LoRAP-tuning V2 (prefix length=128)

Training Optimization

SFTbatch size and learning rate tuning as reported

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/williamliujl/CMExam

Data URLs

https://github.com/williamliujl/CMExam

Risks & Boundaries

Limitations

Non-text questions (images/tables) were excluded, which may bias the question mix.

BLEU and ROUGE do not fully capture explanation quality; authors recommend human evaluation.

When Not To Use

Not for automated patient diagnosis or competence certification of individuals.

Not for imaging- or table-based medical tasks (those items were excluded).

Failure Modes

Models produce plausible but incorrect explanations despite correct answers.

Poor coverage and accuracy on Traditional Chinese Medicine and some rare departments.

Core Entities

Models

GPT-4GPT-3.5-turboChatGLM-6BLLaMA-7BAlpaca-7BVicuna-7BHuatuo-7BDoctorGLM-6BBERTRoBERTa

Metrics

AccuracyWeighted F1BLEU-1BLEU-4ROUGE-1ROUGE-2ROUGE-L

Datasets

CMExam

Benchmarks

CMExam benchmark (answer prediction, reasoning)

Context Entities

Models

MedAlpacaHuatuo-CMExamChatGLM-CMExam

Metrics

Accuracy

Datasets

CNMLE (source)

Benchmarks

few-shot and chain-of-thought prompting experiments

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 is the top zero-shot answer predictor on CMExam.

Fine-tuning small models on CMExam substantially improves accuracy.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding