Overview
Solid experimental design and multiple evaluation axes; human evaluation and contamination checks strengthen claims, but datasets exclude images and some prompts/Model variants were not tested.
Citations8
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 25%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
If you plan to deploy LLMs in clinical workflows, expect lower accuracy and shaky explanations on realistic, complex cases; include clinician review and dataset-specific testing before adoption.
Who Should Care
Summary TLDR
The authors build two new medical multiple-choice QA sets with expert-written explanations: JAMA Clinical Challenge (1,524 long, complex real-case questions) and Medbullets (308 recent Step 2/3 style questions). They test seven LLMs (GPT-3.5, GPT-4, PaLM 2, Llama 2/3, MedAlpaca, Meerkat). The new datasets are measurably harder than prior benchmarks: top models drop accuracy (GPT-4 drops >12% on the newer Medbullets vs MedQA). Chain-of-thought helps on some benchmarks; few-shot helps only some models. Automatic explanation metrics disagree with human judgments, so human evaluation remains necessary. The paper releases code and Medbullets data; JAMA access requires license.
Problem Statement
Existing medical QA benchmarks (board exams, textbook-style questions) are too easy or lack expert explanations. That limits testing whether LLMs can explain and reason about complex, real clinical cases. The authors create harder datasets with expert-written explanations and evaluate multiple LLMs on answering and explaining such cases.
Main Contribution
Two new English multiple-choice medical QA datasets with expert-written explanations: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 Step 2/3-style questions).
Comprehensive evaluation of seven LLMs (closed- and open-source, general and medical) on answer accuracy and explanation quality using zero-shot, few-shot, and chain-of-thought prompts.
Key Findings
The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).
State-of-the-art LLMs perform worse on the new datasets than on prior benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | MedQA-4: 78.63%; Medbullets-4: 66.23%; JAMA: 67.32% | MedQA-4 | MedQA-4 → Medbullets-4: −12.4% | MedQA-4, Medbullets-4, JAMA | Table 2 (X→Y rows) | Table 2 |
| Accuracy | MedQA-4: 82.64%; Medbullets-4: 68.83%; JAMA: 67.13% | GPT-4 X→Y | MedQA-4 +4.01 pp (78.63→82.64); Medbullets-4 +2.60 pp | MedQA-4, Medbullets-4, JAMA | Table 2 (X→RY rows) | Table 2 |
What To Try In 7 Days
Run your model on a sample of JAMA/Medbullets cases to detect hidden weaknesses.
Compare zero-shot, few-shot, and CoT prompts on your target model—measure both answer accuracy and explanation quality.
Add a simple human-in-the-loop check for explanations flagged as low-confidence or containing new/made-up options.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
All images were excluded—many real cases rely on imaging for diagnosis.
Did not explore advanced adaptation: ensemble methods, dynamic few-shot selection, or few-shot CoT with expert exemplars.
When Not To Use
Do not use these text-only results to validate multimodal clinical tasks requiring images.
Do not substitute automatic explanation metrics for clinician review in high-stakes settings.
Failure Modes
CoT-specific errors: outputs 'none of the above', invents new answer choices, or selects multiple answers.
Model hallucinations and incorrect clinical facts in explanations.

