New, harder medical QA datasets (JAMA Clinical Challenge, Medbullets) expose limits of LLMs for clinical reasoning and explanations

February 28, 20247 min

Overview

Decision SnapshotNeeds Validation

Solid experimental design and multiple evaluation axes; human evaluation and contamination checks strengthen claims, but datasets exclude images and some prompts/Model variants were not tested.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 30%

Novelty: 40%

Authors

Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to deploy LLMs in clinical workflows, expect lower accuracy and shaky explanations on realistic, complex cases; include clinician review and dataset-specific testing before adoption.

Who Should Care

Summary TLDR

The authors build two new medical multiple-choice QA sets with expert-written explanations: JAMA Clinical Challenge (1,524 long, complex real-case questions) and Medbullets (308 recent Step 2/3 style questions). They test seven LLMs (GPT-3.5, GPT-4, PaLM 2, Llama 2/3, MedAlpaca, Meerkat). The new datasets are measurably harder than prior benchmarks: top models drop accuracy (GPT-4 drops >12% on the newer Medbullets vs MedQA). Chain-of-thought helps on some benchmarks; few-shot helps only some models. Automatic explanation metrics disagree with human judgments, so human evaluation remains necessary. The paper releases code and Medbullets data; JAMA access requires license.

Problem Statement

Existing medical QA benchmarks (board exams, textbook-style questions) are too easy or lack expert explanations. That limits testing whether LLMs can explain and reason about complex, real clinical cases. The authors create harder datasets with expert-written explanations and evaluate multiple LLMs on answering and explaining such cases.

Main Contribution

Two new English multiple-choice medical QA datasets with expert-written explanations: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 Step 2/3-style questions).

Comprehensive evaluation of seven LLMs (closed- and open-source, general and medical) on answer accuracy and explanation quality using zero-shot, few-shot, and chain-of-thought prompts.

Key Findings

The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).

NumbersJAMA=1,524; Medbullets=308; Table 1

Practical UseUse these sets to test LLMs on longer, more complex clinical cases and on producing explanations, not just answers.

Evidence RefDataset §2 and Table 1

State-of-the-art LLMs perform worse on the new datasets than on prior benchmarks.

NumbersGPT-4 accuracy: MedQA-4 78.63% → Medbullets-4 66.23% (drop ≈12.4%); Table 2

Practical UseIf your model passed older medical benchmarks, expect lower real-world performance on recent, complex clinical cases.

Evidence RefResults §4.1 and Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyMedQA-4: 78.63%; Medbullets-4: 66.23%; JAMA: 67.32%MedQA-4MedQA-4 → Medbullets-4: −12.4%MedQA-4, Medbullets-4, JAMATable 2 (X→Y rows)Table 2
AccuracyMedQA-4: 82.64%; Medbullets-4: 68.83%; JAMA: 67.13%GPT-4 X→YMedQA-4 +4.01 pp (78.6382.64); Medbullets-4 +2.60 ppMedQA-4, Medbullets-4, JAMATable 2 (X→RY rows)Table 2

What To Try In 7 Days

Run your model on a sample of JAMA/Medbullets cases to detect hidden weaknesses.

Compare zero-shot, few-shot, and CoT prompts on your target model—measure both answer accuracy and explanation quality.

Add a simple human-in-the-loop check for explanations flagged as low-confidence or containing new/made-up options.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Medbullets: https://step2.medbullets.com/ (collected links via X)JAMA: article URLs and scraper provided; JAMA content requires license per paper

Risks & Boundaries

Limitations

All images were excluded—many real cases rely on imaging for diagnosis.

Did not explore advanced adaptation: ensemble methods, dynamic few-shot selection, or few-shot CoT with expert exemplars.

When Not To Use

Do not use these text-only results to validate multimodal clinical tasks requiring images.

Do not substitute automatic explanation metrics for clinician review in high-stakes settings.

Failure Modes

CoT-specific errors: outputs 'none of the above', invents new answer choices, or selects multiple answers.

Model hallucinations and incorrect clinical facts in explanations.

Core Entities

Models

GPT-3.5GPT-4PaLM 2Llama 2Llama 3MedAlpacaMeerkat

Metrics

AccuracyROUGE-LBERTScoreBLEURTBARTScore+BARTScore++CTC (Relevance, Preservation, Consistency)G‑Eval (Coherence, Consistency, Relevance)

Datasets

JAMA Clinical ChallengeMedbulletsMedQA

Benchmarks

MedQA