GPT-4 can pass Japan's medical licensing exam but shows costly localization and safety gaps

Overview

Decision SnapshotReady For Pilot

GPT-4 follows patterns of human difficulty (struggles on questions that students find hard) but lacks reliable country-specific legal/safety judgment without extra checks.

Citations50

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 40%

Authors

Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir Radev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can meet exam-level MCQ performance in non-English, specialized domains but need localization, safety filters, and higher budget due to tokenization and legal differences.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

The authors build IGAKU QA, a benchmark of Japanese national medical licensing exam questions (2018–2023). They test GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4 in a closed-book setting. GPT-4 passes all six years of the exam but lags behind the student majority baseline and sometimes recommends prohibited medical actions. Japanese input uses ~2x tokens vs English, raising API cost and shrinking effective context. The authors release the dataset, model outputs, and metadata.

Problem Statement

Can current black-box LLM APIs answer high-stakes, country-specific medical multiple-choice questions written in Japanese? The paper measures accuracy, prohibited-choice selection (illegal/unsafe actions), and practical costs (tokenization and API usage) on real national exam data.

Main Contribution

Created and released IGAKU QA: Japanese medical licensing exam questions and metadata (2018–2023).

Benchmarked GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4 in closed-book settings on the dataset.

Key Findings

GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.

Numbers2018: required 161, general 221 (passing 160/208); Table 1

Practical UseGPT-4 reaches the exam's minimum automated standard for MCQ tasks in Japanese, so it can be used for educational QA prototypes but requires extra checks before clinical use.

Evidence RefTable 1, §3.2

GPT-4 substantially underperforms the student-majority baseline on the same exams.

Numbers2018 totals: GPT-4 ~382 vs student majority ~472 (Table 1)

Practical UseDo not treat GPT-4 as a substitute for experienced students or clinicians; use it as an assistant with human review.

Evidence RefTable 1, §3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Passing exam (per year)	GPT-4 passed all years 2018–2023; GPT-3/ChatGPT did not	Student majority baseline	GPT-4 < student majority (e.g., 2018: 382 vs 472)	IGAKU QA (NMPQE 2018–2023)	Table 1: GPT-4 required/general scores meet passing thresholds across years	Table 1, §3.2
Prohibited choices selected	ChatGPT sometimes exceeded 3 prohibited choices; GPT-4 selected 0–1	Allowed ≤3 prohibited choices to pass	ChatGPT violated safety filter in multiple years; GPT-4 mostly complied	IGAKU QA (NMPQE 2018–2022)	Table 1 rows for P. (prohibited choices); Fig.1 & Fig.6 examples	Table 1, §3.3

What To Try In 7 Days

Run GPT-4 on a small subset of your country's domain-specific MCQs to compare to local experts.

Measure real token usage and API cost for your non-Latin text vs English translations.

Add a rules-based filter for prohibited/legal actions and test detection on model outputs.

Optimization Features

Token Efficiency

vocabulary swapping (suggested)translate-then-run (ChatGPT-EN improved performance)

Infra Optimization

expect doubled token/bandwidth costs for Japanese inputs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/jungokasai/IgakuQA

Data URLs

https://github.com/jungokasai/IgakuQA

Risks & Boundaries

Limitations

Black-box APIs: results may change with model updates and are not fully reproducible.

Potential data leakage: training data unknown; authors attempt mitigation with 2023 exam but risk remains.

When Not To Use

Do not use raw LLM outputs for clinical decisions or legal advice without expert review.

Avoid relying on closed-book LLMs where up-to-date local regulations or images are required.

Failure Modes

Selecting prohibited/illegal medical choices (safety/legal errors).

Tokenization causing shortened context and truncated prompts for long inputs.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)GPT-3 (text-davinci-003)

Metrics

Accuracypassing criteria (required/general sections)prohibited-choice counttoken usage / token cost

Datasets

IGAKU QA (Japanese NMPQE 2018–2023)

Benchmarks

Japanese National Medical Practitioners Qualifying Examination (NMPQE)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.

GPT-4 substantially underperforms the student-majority baseline on the same exams.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding