Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.25
Citation Count
8
Why It Matters For Business
If you plan to deploy LLMs in clinical workflows, expect lower accuracy and shaky explanations on realistic, complex cases; include clinician review and dataset-specific testing before adoption.
Summary TLDR
The authors build two new medical multiple-choice QA sets with expert-written explanations: JAMA Clinical Challenge (1,524 long, complex real-case questions) and Medbullets (308 recent Step 2/3 style questions). They test seven LLMs (GPT-3.5, GPT-4, PaLM 2, Llama 2/3, MedAlpaca, Meerkat). The new datasets are measurably harder than prior benchmarks: top models drop accuracy (GPT-4 drops >12% on the newer Medbullets vs MedQA). Chain-of-thought helps on some benchmarks; few-shot helps only some models. Automatic explanation metrics disagree with human judgments, so human evaluation remains necessary. The paper releases code and Medbullets data; JAMA access requires license.
Problem Statement
Existing medical QA benchmarks (board exams, textbook-style questions) are too easy or lack expert explanations. That limits testing whether LLMs can explain and reason about complex, real clinical cases. The authors create harder datasets with expert-written explanations and evaluate multiple LLMs on answering and explaining such cases.
Main Contribution
Two new English multiple-choice medical QA datasets with expert-written explanations: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 Step 2/3-style questions).
Comprehensive evaluation of seven LLMs (closed- and open-source, general and medical) on answer accuracy and explanation quality using zero-shot, few-shot, and chain-of-thought prompts.
Automatic and human evaluation of model explanations, plus tests for robustness and data contamination.
Analysis showing the new datasets are harder than previous benchmarks and highlighting gaps in automated metrics for medical explanation quality.
Key Findings
The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).
State-of-the-art LLMs perform worse on the new datasets than on prior benchmarks.
Prompting behavior varies: chain-of-thought (CoT) helps some models and tasks; few-shot helps some models but often gives marginal or negative change.
Automatic metrics weakly align with human judgments for explanation quality.
LLMs produce systematic explanation and reasoning errors, including hallucinations and new error types under CoT (e.g., 'none of the above', made-up answers, multiple choices).
Results
Accuracy
Accuracy
Human eval scores (explanations)
Automatic-to-human correlation (explanations)
Who Should Care
What To Try In 7 Days
Run your model on a sample of JAMA/Medbullets cases to detect hidden weaknesses.
Compare zero-shot, few-shot, and CoT prompts on your target model—measure both answer accuracy and explanation quality.
Add a simple human-in-the-loop check for explanations flagged as low-confidence or containing new/made-up options.
Reproducibility
Code Urls
Data Urls
- Medbullets: https://step2.medbullets.com/ (collected links via X)
- JAMA: article URLs and scraper provided; JAMA content requires license per paper
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- All images were excluded—many real cases rely on imaging for diagnosis.
- Did not explore advanced adaptation: ensemble methods, dynamic few-shot selection, or few-shot CoT with expert exemplars.
- JAMA dataset access restricted by licensing; full public download is limited.
When Not To Use
- Do not use these text-only results to validate multimodal clinical tasks requiring images.
- Do not substitute automatic explanation metrics for clinician review in high-stakes settings.
- Do not assume few-shot or CoT prompting will uniformly improve performance; test per model and data.
Failure Modes
- CoT-specific errors: outputs 'none of the above', invents new answer choices, or selects multiple answers.
- Model hallucinations and incorrect clinical facts in explanations.
- Low or inconsistent correlation between automatic metrics and human judgment for explanations.
- Smaller medical finetuned models (e.g., MedAlpaca, Meerkat) can fail on long/complex JAMA cases.
Core Entities
Models
- GPT-3.5
- GPT-4
- PaLM 2
- Llama 2
- Llama 3
- MedAlpaca
- Meerkat
Metrics
- Accuracy
- ROUGE-L
- BERTScore
- BLEURT
- BARTScore+
- BARTScore++
- CTC (Relevance, Preservation, Consistency)
- G‑Eval (Coherence, Consistency, Relevance)
Datasets
- JAMA Clinical Challenge
- Medbullets
- MedQA
Benchmarks
- MedQA

