GPT-4 can pass Japan's medical licensing exam but shows costly localization and safety gaps

March 31, 20237 min

Overview

Decision SnapshotReady For Pilot

GPT-4 follows patterns of human difficulty (struggles on questions that students find hard) but lacks reliable country-specific legal/safety judgment without extra checks.

Citations50

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 40%

Authors

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir Radev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can meet exam-level MCQ performance in non-English, specialized domains but need localization, safety filters, and higher budget due to tokenization and legal differences.

Who Should Care

Summary TLDR

The authors build IGAKU QA, a benchmark of Japanese national medical licensing exam questions (2018–2023). They test GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4 in a closed-book setting. GPT-4 passes all six years of the exam but lags behind the student majority baseline and sometimes recommends prohibited medical actions. Japanese input uses ~2x tokens vs English, raising API cost and shrinking effective context. The authors release the dataset, model outputs, and metadata.

Problem Statement

Can current black-box LLM APIs answer high-stakes, country-specific medical multiple-choice questions written in Japanese? The paper measures accuracy, prohibited-choice selection (illegal/unsafe actions), and practical costs (tokenization and API usage) on real national exam data.

Main Contribution

Created and released IGAKU QA: Japanese medical licensing exam questions and metadata (2018–2023).

Benchmarked GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4 in closed-book settings on the dataset.

Key Findings

GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.

Numbers2018: required 161, general 221 (passing 160/208); Table 1

Practical UseGPT-4 reaches the exam's minimum automated standard for MCQ tasks in Japanese, so it can be used for educational QA prototypes but requires extra checks before clinical use.

Evidence RefTable 1, §3.2

GPT-4 substantially underperforms the student-majority baseline on the same exams.

Numbers2018 totals: GPT-4 ~382 vs student majority ~472 (Table 1)

Practical UseDo not treat GPT-4 as a substitute for experienced students or clinicians; use it as an assistant with human review.

Evidence RefTable 1, §3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Passing exam (per year)GPT-4 passed all years 20182023; GPT-3/ChatGPT did notStudent majority baselineGPT-4 < student majority (e.g., 2018: 382 vs 472)IGAKU QA (NMPQE 20182023)Table 1: GPT-4 required/general scores meet passing thresholds across yearsTable 1, §3.2
Prohibited choices selectedChatGPT sometimes exceeded 3 prohibited choices; GPT-4 selected 01Allowed ≤3 prohibited choices to passChatGPT violated safety filter in multiple years; GPT-4 mostly compliedIGAKU QA (NMPQE 20182022)Table 1 rows for P. (prohibited choices); Fig.1 & Fig.6 examplesTable 1, §3.3

What To Try In 7 Days

Run GPT-4 on a small subset of your country's domain-specific MCQs to compare to local experts.

Measure real token usage and API cost for your non-Latin text vs English translations.

Add a rules-based filter for prohibited/legal actions and test detection on model outputs.

Optimization Features

Token Efficiency
vocabulary swapping (suggested)translate-then-run (ChatGPT-EN improved performance)
Infra Optimization
expect doubled token/bandwidth costs for Japanese inputs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Black-box APIs: results may change with model updates and are not fully reproducible.

Potential data leakage: training data unknown; authors attempt mitigation with 2023 exam but risk remains.

When Not To Use

Do not use raw LLM outputs for clinical decisions or legal advice without expert review.

Avoid relying on closed-book LLMs where up-to-date local regulations or images are required.

Failure Modes

Selecting prohibited/illegal medical choices (safety/legal errors).

Tokenization causing shortened context and truncated prompts for long inputs.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)GPT-3 (text-davinci-003)

Metrics

Accuracypassing criteria (required/general sections)prohibited-choice counttoken usage / token cost

Datasets

IGAKU QA (Japanese NMPQE 2018–2023)

Benchmarks

Japanese National Medical Practitioners Qualifying Examination (NMPQE)