GPT-4 exceeds USMLE pass threshold and outperforms prior models on medical benchmarks

Overview

Decision SnapshotNeeds Validation

The paper shows strong benchmark-level capabilities and better confidence calibration, but it focuses on multiple-choice tasks and does not replace validated clinical systems; expert oversight and further real-world evaluation are needed.

Citations497

Evidence Strength0.70

Confidence0.90

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 60%

Authors

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, Eric Horvitz

Links

Abstract / PDF / Data

Why It Matters For Business

GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder Engineering Lead

Summary TLDR

This paper measures GPT-4 (text-only) on medical multiple-choice exams and benchmarks. Zero-shot GPT-4 averages ~84% on USMLE sample and self-assessment materials and ~86% with 5-shot prompts—well above typical passing (~60%) and much higher than GPT-3.5 (~50–59%). GPT-4 is also better calibrated than GPT-3.5, handles many image-referencing questions without images, and shows no evidence of memorizing official USMLE items under the authors' tests. The study focuses on MCQs, uses simple prompting, and emphasizes that real-world clinical use requires expert oversight and further evaluation.

Problem Statement

Measure out-of-the-box capabilities of text-only GPT-4 on medical competency exams (USMLE Steps 1–3) and MultiMedQA benchmarks, compare to GPT-3.5 and reported PaLM-family results, and probe calibration, memorization, and image-dependence using simple zero-shot and 5-shot prompts.

Main Contribution

Comprehensive zero-shot and 5-shot evaluation of GPT-4 on USMLE Sample Exam and Self Assessments and on MultiMedQA components.

Quantified large gains over GPT-3.5: ~30+ percentage points on USMLE compared to GPT-3.5.

Key Findings

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

NumbersUSMLE Self Assessment overall: GPT-4 83.76% (zero-shot) vs GPT-3.5 49.1%

Practical UseYou can get much higher MCQ accuracy with GPT-4 vs GPT-3.5; use GPT-4 for prototyping medical QA but verify outputs with experts.

Evidence RefTable 1 (USMLE Self Assessment overall averages)

GPT-4 exceeds typical USMLE pass thresholds by a large margin.

NumbersUSMLE Sample Exam overall: GPT-4 84.31% (zero-shot); pass ~60%

Practical UseGPT-4 can pass exam-style knowledge checks out-of-the-box; it may be useful for education and exam prep tools under supervision.

Evidence RefTable 2 and USMLE pass threshold statement

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 zero-shot 83.76%; GPT-4 5-shot 86.65%	GPT-3.5 zero-shot 49.1%; GPT-3.5 5-shot 53.61%	≈+34–37 percentage points vs GPT-3.5	USMLE Self Assessment (Steps 1–3)	Table 1 reports overall averages for GPT-4 and GPT-3.5	Table 1
Accuracy	GPT-4 zero-shot 84.31%; GPT-4 5-shot 86.7%	GPT-3.5 zero-shot 56.91%; GPT-3.5 5-shot 58.78%	≈+25–30 percentage points vs GPT-3.5	USMLE Sample Exam (Steps 1–3)	Table 2 reports sample-exam averages	Table 2

What To Try In 7 Days

Run GPT-4 on your internal medical QA or training questions to gauge accuracy and calibration vs domain experts.

Prototype an education assistant that explains answers and offers counterfactuals, with faculty review.

Measure GPT-4 confidence calibration on your datasets and flag low-confidence outputs for human review.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Data URLs

MedQA, PubMedQA, MedMCQA, MultiMedQA and MMLU are publicly available per Appendix A; USMLE materials are paid NBME content

Risks & Boundaries

Limitations

Focuses mainly on multiple-choice questions; does not evaluate full interactive USMLE case simulations quantitatively.

Evaluations use a text-only GPT-4; visual/multimodal performance not measured here.

When Not To Use

Do not use GPT-4 alone for high-stakes clinical decisions without expert verification.

Avoid relying on text-only GPT-4 for image-based diagnoses without a multimodal model or image inputs.

Failure Modes

Hallucinations: fluent but incorrect medical statements.

Overconfidence in some predictions despite improved calibration relative to GPT-3.5.

Core Entities

Models

GPT-4GPT-4-baseGPT-3.5ChatGPTFlan-PaLM 540BMed-PaLMInstructGPTCodex

Metrics

AccuracyCalibration (predicted prob vs true rate)Levenshtein overlap (MELD)

Datasets

USMLE Self AssessmentUSMLE Sample ExamMedQAPubMedQAMedMCQAMMLUMultiMedQA

Benchmarks

USMLEMultiMedQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.

GPT-4 exceeds typical USMLE pass thresholds by a large margin.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding