GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

March 20, 20239 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

497

Authors

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, Eric Horvitz

Links

Abstract / PDF

Why It Matters For Business

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Summary TLDR

This paper measures GPT-4 (text-only) on medical multiple-choice exams and benchmarks. Zero-shot GPT-4 averages ~84% on USMLE sample and self-assessment materials and ~86% with 5-shot prompts—well above typical passing (~60%) and much higher than GPT-3.5 (~50–59%). GPT-4 is also better calibrated than GPT-3.5, handles many image-referencing questions without images, and shows no evidence of memorizing official USMLE items under the authors' tests. The study focuses on MCQs, uses simple prompting, and emphasizes that real-world clinical use requires expert oversight and further evaluation.

Problem Statement

Measure out-of-the-box capabilities of text-only GPT-4 on medical competency exams (USMLE Steps 1–3) and MultiMedQA benchmarks, compare to GPT-3.5 and reported PaLM-family results, and probe calibration, memorization, and image-dependence using simple zero-shot and 5-shot prompts.

Main Contribution

Comprehensive zero-shot and 5-shot evaluation of GPT-4 on USMLE Sample Exam and Self Assessments and on MultiMedQA components.

Quantified large gains over GPT-3.5: ~30+ percentage points on USMLE compared to GPT-3.5.

Assessment of calibration showing GPT-4's probabilities align much better with actual correctness than GPT-3.5.

Probed image dependence: text-only GPT-4 still answers many media-referenced questions correctly using reasoning.

Developed and used MELD (Levenshtein-based) heuristic to probe memorization; found no strong evidence of memorization for tested USMLE samples.

Compared GPT-4-base (pre-alignment) to the public aligned GPT-4 and observed a 3–5% raw performance drop after alignment.

Key Findings

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

NumbersUSMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

GPT-4 exceeds typical USMLE pass thresholds by a large margin.

NumbersUSMLE Sample Exam overall: GPT-4 84.31% (zero-shot); pass ~60%

GPT-4 is notably better calibrated than GPT-3.5.

NumbersPredictions given 0.96 prob are correct 93% (GPT-4) vs 55% (GPT-3.5).

Text-only GPT-4 still answers many image-referencing questions well.

NumbersOn USMLE Sample Exam media questions: GPT-4 (zero-shot) 75.51% vs text-only questions 85.63%

Authors found no clear memorization of official USMLE items using their heuristic.

NumbersMELD failed to regenerate USMLE samples above 50% overlap; SQuAD regeneration seen 17% at 99% overlap

Alignment/ safety fine-tuning reduced raw accuracy modestly.

NumbersGPT-4-base scores ~3–5% higher than publicly released GPT-4 across some datasets

Results

Accuracy

ValueGPT-4 zero-shot 83.76%; GPT-4 5-shot 86.65%

BaselineGPT-3.5 zero-shot 49.1%; GPT-3.5 5-shot 53.61%

Accuracy

ValueGPT-4 zero-shot 84.31%; GPT-4 5-shot 86.7%

BaselineGPT-3.5 zero-shot 56.91%; GPT-3.5 5-shot 58.78%

Performance on MultiMedQA components (examples)

ValueMMLU Clinical Knowledge GPT-4 zero-shot 86.04% (5-shot 86.42%)

BaselineGPT-3.5 zero-shot 69.81% (5-shot 68.68%)

Calibration at high confidence

ValuePredictions with avg prob 0.96 are correct 93% (GPT-4)

BaselineSame prob predictions correct 55% (GPT-3.5)

Accuracy

ValueUSMLE Sample Exam media questions GPT-4 zero-shot 75.51% (text-only 85.63%)

BaselineGPT-3.5 zero-shot on media 51.02% (text-only 57.8%)

MELD memorization checks

ValueNo near-exact regenerations for USMLE items detected; SQuAD regeneration 17% at 99% overlap

BaselineMELD detects known training-set contamination for SQuAD

Who Should Care

What To Try In 7 Days

Run GPT-4 on your internal medical QA or training questions to gauge accuracy and calibration vs domain experts.

Prototype an education assistant that explains answers and offers counterfactuals, with faculty review.

Measure GPT-4 confidence calibration on your datasets and flag low-confidence outputs for human review.

Reproducibility

Data Urls

  • MedQA, PubMedQA, MedMCQA, MultiMedQA and MMLU are publicly available per Appendix A; USMLE materials are paid NBME content

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses mainly on multiple-choice questions; does not evaluate full interactive USMLE case simulations quantitatively.
  • Evaluations use a text-only GPT-4; visual/multimodal performance not measured here.
  • Official USMLE live-exam items and scoring were not accessible; datasets come from sample/self-assessment materials.
  • MELD memorization detector has unknown recall; absence of detection is not proof of absence in training data.
  • Chain-of-thought and advanced prompting were tested only preliminarily; better prompts may change results.

When Not To Use

  • Do not use GPT-4 alone for high-stakes clinical decisions without expert verification.
  • Avoid relying on text-only GPT-4 for image-based diagnoses without a multimodal model or image inputs.
  • Not suitable as a drop-in replacement for clinicians or certified exam performance without regulatory and safety validation.

Failure Modes

  • Hallucinations: fluent but incorrect medical statements.
  • Overconfidence in some predictions despite improved calibration relative to GPT-3.5.
  • Degraded accuracy for questions that depend on unseen images.
  • Sensitivity to prompt wording; performance can vary with prompt design.
  • Alignment/safety fine-tuning can slightly reduce raw accuracy.

Core Entities

Models

  • GPT-4
  • GPT-4-base
  • GPT-3.5
  • ChatGPT
  • Flan-PaLM 540B
  • Med-PaLM
  • InstructGPT
  • Codex

Metrics

  • Accuracy
  • Calibration (predicted prob vs true rate)
  • Levenshtein overlap (MELD)

Datasets

  • USMLE Self Assessment
  • USMLE Sample Exam
  • MedQA
  • PubMedQA
  • MedMCQA
  • MMLU
  • MultiMedQA

Benchmarks

  • USMLE
  • MultiMedQA