Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
497
Why It Matters For Business
GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.
Summary TLDR
This paper measures GPT-4 (text-only) on medical multiple-choice exams and benchmarks. Zero-shot GPT-4 averages ~84% on USMLE sample and self-assessment materials and ~86% with 5-shot prompts—well above typical passing (~60%) and much higher than GPT-3.5 (~50–59%). GPT-4 is also better calibrated than GPT-3.5, handles many image-referencing questions without images, and shows no evidence of memorizing official USMLE items under the authors' tests. The study focuses on MCQs, uses simple prompting, and emphasizes that real-world clinical use requires expert oversight and further evaluation.
Problem Statement
Measure out-of-the-box capabilities of text-only GPT-4 on medical competency exams (USMLE Steps 1–3) and MultiMedQA benchmarks, compare to GPT-3.5 and reported PaLM-family results, and probe calibration, memorization, and image-dependence using simple zero-shot and 5-shot prompts.
Main Contribution
Comprehensive zero-shot and 5-shot evaluation of GPT-4 on USMLE Sample Exam and Self Assessments and on MultiMedQA components.
Quantified large gains over GPT-3.5: ~30+ percentage points on USMLE compared to GPT-3.5.
Assessment of calibration showing GPT-4's probabilities align much better with actual correctness than GPT-3.5.
Probed image dependence: text-only GPT-4 still answers many media-referenced questions correctly using reasoning.
Developed and used MELD (Levenshtein-based) heuristic to probe memorization; found no strong evidence of memorization for tested USMLE samples.
Compared GPT-4-base (pre-alignment) to the public aligned GPT-4 and observed a 3–5% raw performance drop after alignment.
Key Findings
GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.
GPT-4 exceeds typical USMLE pass thresholds by a large margin.
GPT-4 is notably better calibrated than GPT-3.5.
Text-only GPT-4 still answers many image-referencing questions well.
Authors found no clear memorization of official USMLE items using their heuristic.
Alignment/ safety fine-tuning reduced raw accuracy modestly.
Results
Accuracy
Accuracy
Performance on MultiMedQA components (examples)
Calibration at high confidence
Accuracy
MELD memorization checks
Who Should Care
What To Try In 7 Days
Run GPT-4 on your internal medical QA or training questions to gauge accuracy and calibration vs domain experts.
Prototype an education assistant that explains answers and offers counterfactuals, with faculty review.
Measure GPT-4 confidence calibration on your datasets and flag low-confidence outputs for human review.
Reproducibility
Data Urls
- MedQA, PubMedQA, MedMCQA, MultiMedQA and MMLU are publicly available per Appendix A; USMLE materials are paid NBME content
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses mainly on multiple-choice questions; does not evaluate full interactive USMLE case simulations quantitatively.
- Evaluations use a text-only GPT-4; visual/multimodal performance not measured here.
- Official USMLE live-exam items and scoring were not accessible; datasets come from sample/self-assessment materials.
- MELD memorization detector has unknown recall; absence of detection is not proof of absence in training data.
- Chain-of-thought and advanced prompting were tested only preliminarily; better prompts may change results.
When Not To Use
- Do not use GPT-4 alone for high-stakes clinical decisions without expert verification.
- Avoid relying on text-only GPT-4 for image-based diagnoses without a multimodal model or image inputs.
- Not suitable as a drop-in replacement for clinicians or certified exam performance without regulatory and safety validation.
Failure Modes
- Hallucinations: fluent but incorrect medical statements.
- Overconfidence in some predictions despite improved calibration relative to GPT-3.5.
- Degraded accuracy for questions that depend on unseen images.
- Sensitivity to prompt wording; performance can vary with prompt design.
- Alignment/safety fine-tuning can slightly reduce raw accuracy.
Core Entities
Models
- GPT-4
- GPT-4-base
- GPT-3.5
- ChatGPT
- Flan-PaLM 540B
- Med-PaLM
- InstructGPT
- Codex
Metrics
- Accuracy
- Calibration (predicted prob vs true rate)
- Levenshtein overlap (MELD)
Datasets
- USMLE Self Assessment
- USMLE Sample Exam
- MedQA
- PubMedQA
- MedMCQA
- MMLU
- MultiMedQA
Benchmarks
- USMLE
- MultiMedQA

