Overview
The paper shows strong benchmark-level capabilities and better confidence calibration, but it focuses on multiple-choice tasks and does not replace validated clinical systems; expert oversight and further real-world evaluation are needed.
Citations497
Evidence Strength0.70
Confidence0.90
Risk Signals13
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
GPT-4 can reliably answer medical multiple-choice questions and give better-calibrated confidence scores than earlier models, making it useful for education, drafting clinical notes, and decision support prototypes—provided human oversight and validation.
Who Should Care
Summary TLDR
This paper measures GPT-4 (text-only) on medical multiple-choice exams and benchmarks. Zero-shot GPT-4 averages ~84% on USMLE sample and self-assessment materials and ~86% with 5-shot prompts—well above typical passing (~60%) and much higher than GPT-3.5 (~50–59%). GPT-4 is also better calibrated than GPT-3.5, handles many image-referencing questions without images, and shows no evidence of memorizing official USMLE items under the authors' tests. The study focuses on MCQs, uses simple prompting, and emphasizes that real-world clinical use requires expert oversight and further evaluation.
Problem Statement
Measure out-of-the-box capabilities of text-only GPT-4 on medical competency exams (USMLE Steps 1–3) and MultiMedQA benchmarks, compare to GPT-3.5 and reported PaLM-family results, and probe calibration, memorization, and image-dependence using simple zero-shot and 5-shot prompts.
Main Contribution
Comprehensive zero-shot and 5-shot evaluation of GPT-4 on USMLE Sample Exam and Self Assessments and on MultiMedQA components.
Quantified large gains over GPT-3.5: ~30+ percentage points on USMLE compared to GPT-3.5.
Key Findings
GPT-4 strongly outperforms GPT-3.5 on USMLE-style multiple-choice tests.
GPT-4 exceeds typical USMLE pass thresholds by a large margin.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 zero-shot 83.76%; GPT-4 5-shot 86.65% | GPT-3.5 zero-shot 49.1%; GPT-3.5 5-shot 53.61% | ≈+34–37 percentage points vs GPT-3.5 | USMLE Self Assessment (Steps 1–3) | Table 1 reports overall averages for GPT-4 and GPT-3.5 | Table 1 |
| Accuracy | GPT-4 zero-shot 84.31%; GPT-4 5-shot 86.7% | GPT-3.5 zero-shot 56.91%; GPT-3.5 5-shot 58.78% | ≈+25–30 percentage points vs GPT-3.5 | USMLE Sample Exam (Steps 1–3) | Table 2 reports sample-exam averages | Table 2 |
What To Try In 7 Days
Run GPT-4 on your internal medical QA or training questions to gauge accuracy and calibration vs domain experts.
Prototype an education assistant that explains answers and offers counterfactuals, with faculty review.
Measure GPT-4 confidence calibration on your datasets and flag low-confidence outputs for human review.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Focuses mainly on multiple-choice questions; does not evaluate full interactive USMLE case simulations quantitatively.
Evaluations use a text-only GPT-4; visual/multimodal performance not measured here.
When Not To Use
Do not use GPT-4 alone for high-stakes clinical decisions without expert verification.
Avoid relying on text-only GPT-4 for image-based diagnoses without a multimodal model or image inputs.
Failure Modes
Hallucinations: fluent but incorrect medical statements.
Overconfidence in some predictions despite improved calibration relative to GPT-3.5.

