GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

March 20, 20239 min

Overview

Decision SnapshotNeeds Validation

The paper shows strong benchmark-level capabilities and better confidence calibration, but it focuses on multiple-choice tasks and does not replace validated clinical systems; expert oversight and further real-world evaluation are needed.

Citations497

Evidence Strength0.70

Confidence0.90

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 60%

Authors

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, Eric Horvitz

Links

Abstract / PDF / Data

Why It Matters For Business

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Who Should Care

Summary TLDR

This paper measures GPT-4 (text-only) on medical multiple-choice exams and benchmarks. Zero-shot GPT-4 averages ~84% on USMLE sample and self-assessment materials and ~86% with 5-shot prompts—well above typical passing (~60%) and much higher than GPT-3.5 (~50–59%). GPT-4 is also better calibrated than GPT-3.5, handles many image-referencing questions without images, and shows no evidence of memorizing official USMLE items under the authors' tests. The study focuses on MCQs, uses simple prompting, and emphasizes that real-world clinical use requires expert oversight and further evaluation.

Problem Statement

Measure out-of-the-box capabilities of text-only GPT-4 on medical competency exams (USMLE Steps 1–3) and MultiMedQA benchmarks, compare to GPT-3.5 and reported PaLM-family results, and probe calibration, memorization, and image-dependence using simple zero-shot and 5-shot prompts.

Main Contribution

Comprehensive zero-shot and 5-shot evaluation of GPT-4 on USMLE Sample Exam and Self Assessments and on MultiMedQA components.

Quantified large gains over GPT-3.5: ~30+ percentage points on USMLE compared to GPT-3.5.

Key Findings

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

NumbersUSMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

Practical UseYou can get much higher MCQ accuracy with GPT-4 vs GPT-3.5; use GPT-4 for prototyping medical QA but verify outputs with experts.

Evidence RefTable 1 (USMLE Self Assessment overall averages)

GPT-4 exceeds typical USMLE pass thresholds by a large margin.

NumbersUSMLE Sample Exam overall: GPT-4 84.31% (zero-shot); pass ~60%

Practical UseGPT-4 can pass exam-style knowledge checks out-of-the-box; it may be useful for education and exam prep tools under supervision.

Evidence RefTable 2 and USMLE pass threshold statement

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 zero-shot 83.76%; GPT-4 5-shot 86.65%GPT-3.5 zero-shot 49.1%; GPT-3.5 5-shot 53.61%≈+3437 percentage points vs GPT-3.5USMLE Self Assessment (Steps 13)Table 1 reports overall averages for GPT-4 and GPT-3.5Table 1
AccuracyGPT-4 zero-shot 84.31%; GPT-4 5-shot 86.7%GPT-3.5 zero-shot 56.91%; GPT-3.5 5-shot 58.78%≈+2530 percentage points vs GPT-3.5USMLE Sample Exam (Steps 13)Table 2 reports sample-exam averagesTable 2

What To Try In 7 Days

Run GPT-4 on your internal medical QA or training questions to gauge accuracy and calibration vs domain experts.

Prototype an education assistant that explains answers and offers counterfactuals, with faculty review.

Measure GPT-4 confidence calibration on your datasets and flag low-confidence outputs for human review.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Data URLs

MedQA, PubMedQA, MedMCQA, MultiMedQA and MMLU are publicly available per Appendix A; USMLE materials are paid NBME content

Risks & Boundaries

Limitations

Focuses mainly on multiple-choice questions; does not evaluate full interactive USMLE case simulations quantitatively.

Evaluations use a text-only GPT-4; visual/multimodal performance not measured here.

When Not To Use

Do not use GPT-4 alone for high-stakes clinical decisions without expert verification.

Avoid relying on text-only GPT-4 for image-based diagnoses without a multimodal model or image inputs.

Failure Modes

Hallucinations: fluent but incorrect medical statements.

Overconfidence in some predictions despite improved calibration relative to GPT-3.5.

Core Entities

Models

GPT-4GPT-4-baseGPT-3.5ChatGPTFlan-PaLM 540BMed-PaLMInstructGPTCodex

Metrics

AccuracyCalibration (predicted prob vs true rate)Levenshtein overlap (MELD)

Datasets

USMLE Self AssessmentUSMLE Sample ExamMedQAPubMedQAMedMCQAMMLUMultiMedQA

Benchmarks

USMLEMultiMedQA