Overview
The dataset is high-quality and ready for research benchmarking; models still lag humans and need domain tuning before clinical use.
Citations32
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 45%
Why It Matters For Business
CMExam gives a reliable, large-scale way to measure clinical QA performance for Chinese medical LLMs so teams can identify domain gaps and cost-effectively fine-tune small models.
Who Should Care
Summary TLDR
CMExam is a large, authoritative Chinese medical QA dataset built from the Chinese National Medical Licensing Examination. It contains 68,119 pre-processed questions (60K+ retained), multiple-choice answers, and solution explanations. Each test question has five expert-verified annotations (disease group, clinical department, discipline, competency, difficulty). Benchmarks show GPT-4 reaches 61.6% accuracy (human 71.6%). Fine-tuning small models on CMExam closes gaps for answer prediction and improves explanation quality. The dataset and code are on GitHub for research use.
Problem Statement
Medical LLM evaluation in Chinese lacks a large, standardized, and authoritative benchmark. Existing datasets are often noisy, small, or sourced from web forums. Without a reliable Chinese medical exam dataset, we cannot measure model accuracy or reasoning across medical subfields.
Main Contribution
A large, curated Chinese medical multiple-choice dataset (CMExam) with 68,119 questions and solution explanations.
Five question-level expert annotations: ICD-11 disease groups, 36 clinical departments, 7 medical disciplines, 4 competency areas, and 5 difficulty levels.
Key Findings
GPT-4 is the top zero-shot answer predictor on CMExam.
Fine-tuning small models on CMExam substantially improves accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 61.6% | Human accuracy 71.6% | -10.0 pp | CMExam test set | Table 3 (Prediction Acc) | Table 3 |
| Accuracy | 45.3% | ChatGLM zero-shot 26.3% | +19.0 pp | CMExam test set | Table 3 (Prediction Acc) | Table 3 |
What To Try In 7 Days
Download CMExam and run your model on the test split to get baseline accuracy and error slices.
Fine-tune a small model with LoRA or P-tuning V2 on CMExam training data and compare accuracy vs zero-shot APIs.
Run per-department and per-disease-group analysis to spot weak subdomains for targeted data collection or rules-based fixes.
Agent Features
Frameworks
Architectures
Optimization Features
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Non-text questions (images/tables) were excluded, which may bias the question mix.
BLEU and ROUGE do not fully capture explanation quality; authors recommend human evaluation.
When Not To Use
Not for automated patient diagnosis or competence certification of individuals.
Not for imaging- or table-based medical tasks (those items were excluded).
Failure Modes
Models produce plausible but incorrect explanations despite correct answers.
Poor coverage and accuracy on Traditional Chinese Medicine and some rare departments.

