New, harder medical QA datasets (JAMA Clinical Challenge, Medbullets) expose limits of LLMs for clinical reasoning and explanations

Overview

Decision SnapshotNeeds Validation

Solid experimental design and multiple evaluation axes; human evaluation and contamination checks strengthen claims, but datasets exclude images and some prompts/Model variants were not tested.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 30%

Novelty: 40%

Authors

Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you plan to deploy LLMs in clinical workflows, expect lower accuracy and shaky explanations on realistic, complex cases; include clinician review and dataset-specific testing before adoption.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

The authors build two new medical multiple-choice QA sets with expert-written explanations: JAMA Clinical Challenge (1,524 long, complex real-case questions) and Medbullets (308 recent Step 2/3 style questions). They test seven LLMs (GPT-3.5, GPT-4, PaLM 2, Llama 2/3, MedAlpaca, Meerkat). The new datasets are measurably harder than prior benchmarks: top models drop accuracy (GPT-4 drops >12% on the newer Medbullets vs MedQA). Chain-of-thought helps on some benchmarks; few-shot helps only some models. Automatic explanation metrics disagree with human judgments, so human evaluation remains necessary. The paper releases code and Medbullets data; JAMA access requires license.

Problem Statement

Existing medical QA benchmarks (board exams, textbook-style questions) are too easy or lack expert explanations. That limits testing whether LLMs can explain and reason about complex, real clinical cases. The authors create harder datasets with expert-written explanations and evaluate multiple LLMs on answering and explaining such cases.

Main Contribution

Two new English multiple-choice medical QA datasets with expert-written explanations: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 Step 2/3-style questions).

Comprehensive evaluation of seven LLMs (closed- and open-source, general and medical) on answer accuracy and explanation quality using zero-shot, few-shot, and chain-of-thought prompts.

Key Findings

The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).

NumbersJAMA=1,524; Medbullets=308; Table 1

Practical UseUse these sets to test LLMs on longer, more complex clinical cases and on producing explanations, not just answers.

Evidence RefDataset §2 and Table 1

State-of-the-art LLMs perform worse on the new datasets than on prior benchmarks.

NumbersGPT-4 accuracy: MedQA-4 78.63% → Medbullets-4 66.23% (drop ≈12.4%); Table 2

Practical UseIf your model passed older medical benchmarks, expect lower real-world performance on recent, complex clinical cases.

Evidence RefResults §4.1 and Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	MedQA-4: 78.63%; Medbullets-4: 66.23%; JAMA: 67.32%	MedQA-4	MedQA-4 → Medbullets-4: −12.4%	MedQA-4, Medbullets-4, JAMA	Table 2 (X→Y rows)	Table 2
Accuracy	MedQA-4: 82.64%; Medbullets-4: 68.83%; JAMA: 67.13%	GPT-4 X→Y	MedQA-4 +4.01 pp (78.63→82.64); Medbullets-4 +2.60 pp	MedQA-4, Medbullets-4, JAMA	Table 2 (X→RY rows)	Table 2

What To Try In 7 Days

Run your model on a sample of JAMA/Medbullets cases to detect hidden weaknesses.

Compare zero-shot, few-shot, and CoT prompts on your target model—measure both answer accuracy and explanation quality.

Add a simple human-in-the-loop check for explanations flagged as low-confidence or containing new/made-up options.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/HanjieChen/ChallengeClinicalQA https://anonymous.4open.science/r/ChallengeClinicalQA-E776

Data URLs

Medbullets: https://step2.medbullets.com/ (collected links via X)JAMA: article URLs and scraper provided; JAMA content requires license per paper

Risks & Boundaries

Limitations

All images were excluded—many real cases rely on imaging for diagnosis.

Did not explore advanced adaptation: ensemble methods, dynamic few-shot selection, or few-shot CoT with expert exemplars.

When Not To Use

Do not use these text-only results to validate multimodal clinical tasks requiring images.

Do not substitute automatic explanation metrics for clinician review in high-stakes settings.

Failure Modes

CoT-specific errors: outputs 'none of the above', invents new answer choices, or selects multiple answers.

Model hallucinations and incorrect clinical facts in explanations.

Core Entities

Models

GPT-3.5GPT-4PaLM 2Llama 2Llama 3MedAlpacaMeerkat

Metrics

AccuracyROUGE-LBERTScoreBLEURTBARTScore+BARTScore++CTC (Relevance, Preservation, Consistency)G‑Eval (Coherence, Consistency, Relevance)

Datasets

JAMA Clinical ChallengeMedbulletsMedQA

Benchmarks

MedQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).

State-of-the-art LLMs perform worse on the new datasets than on prior benchmarks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding