New, harder medical QA datasets (JAMA Clinical Challenge, Medbullets) expose limits of LLMs for clinical reasoning and explanations

February 28, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.25

Citation Count

8

Authors

Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze

Links

Abstract / PDF

Why It Matters For Business

If you plan to deploy LLMs in clinical workflows, expect lower accuracy and shaky explanations on realistic, complex cases; include clinician review and dataset-specific testing before adoption.

Summary TLDR

The authors build two new medical multiple-choice QA sets with expert-written explanations: JAMA Clinical Challenge (1,524 long, complex real-case questions) and Medbullets (308 recent Step 2/3 style questions). They test seven LLMs (GPT-3.5, GPT-4, PaLM 2, Llama 2/3, MedAlpaca, Meerkat). The new datasets are measurably harder than prior benchmarks: top models drop accuracy (GPT-4 drops >12% on the newer Medbullets vs MedQA). Chain-of-thought helps on some benchmarks; few-shot helps only some models. Automatic explanation metrics disagree with human judgments, so human evaluation remains necessary. The paper releases code and Medbullets data; JAMA access requires license.

Problem Statement

Existing medical QA benchmarks (board exams, textbook-style questions) are too easy or lack expert explanations. That limits testing whether LLMs can explain and reason about complex, real clinical cases. The authors create harder datasets with expert-written explanations and evaluate multiple LLMs on answering and explaining such cases.

Main Contribution

Two new English multiple-choice medical QA datasets with expert-written explanations: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 Step 2/3-style questions).

Comprehensive evaluation of seven LLMs (closed- and open-source, general and medical) on answer accuracy and explanation quality using zero-shot, few-shot, and chain-of-thought prompts.

Automatic and human evaluation of model explanations, plus tests for robustness and data contamination.

Analysis showing the new datasets are harder than previous benchmarks and highlighting gaps in automated metrics for medical explanation quality.

Key Findings

The authors release two expert-explained datasets: JAMA Clinical Challenge (1,524 cases) and Medbullets (308 cases).

NumbersJAMA=1,524; Medbullets=308; Table 1

State-of-the-art LLMs perform worse on the new datasets than on prior benchmarks.

NumbersGPT-4 accuracy: MedQA-4 78.63% → Medbullets-4 66.23% (drop ≈12.4%); Table 2

Prompting behavior varies: chain-of-thought (CoT) helps some models and tasks; few-shot helps some models but often gives marginal or negative change.

NumbersGPT-4 MedQA-4: X→Y 78.63% → X→RY 82.64% (+4.0%); other models show small or no gains; §4.2–4.3

Automatic metrics weakly align with human judgments for explanation quality.

NumbersG-Eval Relevance vs human Correctness for GPT-4: Pearson ≈ 0.22; Table 5

LLMs produce systematic explanation and reasoning errors, including hallucinations and new error types under CoT (e.g., 'none of the above', made-up answers, multiple choices).

NumbersCoT 'None' error rates up to ~14% for some models; Table 14

Results

Accuracy

ValueMedQA-4: 78.63%; Medbullets-4: 66.23%; JAMA: 67.32%

BaselineMedQA-4

Accuracy

ValueMedQA-4: 82.64%; Medbullets-4: 68.83%; JAMA: 67.13%

BaselineGPT-4 X→Y

Human eval scores (explanations)

ValueGPT-4 Completeness 3.35; Correctness 4.45; Relevance 4.61 (1–5 scale)

BaselinePaLM 2 on same samples: Completeness 2.67; Correctness 4.35; Relevance 4.53

Automatic-to-human correlation (explanations)

ValueG-Eval Relevance vs Human Correctness Pearson ≈ 0.22

Baselineperfect alignment would be near 1.0

Who Should Care

What To Try In 7 Days

Run your model on a sample of JAMA/Medbullets cases to detect hidden weaknesses.

Compare zero-shot, few-shot, and CoT prompts on your target model—measure both answer accuracy and explanation quality.

Add a simple human-in-the-loop check for explanations flagged as low-confidence or containing new/made-up options.

Reproducibility

Data Urls

  • Medbullets: https://step2.medbullets.com/ (collected links via X)
  • JAMA: article URLs and scraper provided; JAMA content requires license per paper

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • All images were excluded—many real cases rely on imaging for diagnosis.
  • Did not explore advanced adaptation: ensemble methods, dynamic few-shot selection, or few-shot CoT with expert exemplars.
  • JAMA dataset access restricted by licensing; full public download is limited.

When Not To Use

  • Do not use these text-only results to validate multimodal clinical tasks requiring images.
  • Do not substitute automatic explanation metrics for clinician review in high-stakes settings.
  • Do not assume few-shot or CoT prompting will uniformly improve performance; test per model and data.

Failure Modes

  • CoT-specific errors: outputs 'none of the above', invents new answer choices, or selects multiple answers.
  • Model hallucinations and incorrect clinical facts in explanations.
  • Low or inconsistent correlation between automatic metrics and human judgment for explanations.
  • Smaller medical finetuned models (e.g., MedAlpaca, Meerkat) can fail on long/complex JAMA cases.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • PaLM 2
  • Llama 2
  • Llama 3
  • MedAlpaca
  • Meerkat

Metrics

  • Accuracy
  • ROUGE-L
  • BERTScore
  • BLEURT
  • BARTScore+
  • BARTScore++
  • CTC (Relevance, Preservation, Consistency)
  • G‑Eval (Coherence, Consistency, Relevance)

Datasets

  • JAMA Clinical Challenge
  • Medbullets
  • MedQA

Benchmarks

  • MedQA