Overview
GPT-4 follows patterns of human difficulty (struggles on questions that students find hard) but lacks reliable country-specific legal/safety judgment without extra checks.
Citations50
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 50%
Novelty: 40%
Why It Matters For Business
LLMs can meet exam-level MCQ performance in non-English, specialized domains but need localization, safety filters, and higher budget due to tokenization and legal differences.
Who Should Care
Summary TLDR
The authors build IGAKU QA, a benchmark of Japanese national medical licensing exam questions (2018–2023). They test GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4 in a closed-book setting. GPT-4 passes all six years of the exam but lags behind the student majority baseline and sometimes recommends prohibited medical actions. Japanese input uses ~2x tokens vs English, raising API cost and shrinking effective context. The authors release the dataset, model outputs, and metadata.
Problem Statement
Can current black-box LLM APIs answer high-stakes, country-specific medical multiple-choice questions written in Japanese? The paper measures accuracy, prohibited-choice selection (illegal/unsafe actions), and practical costs (tokenization and API usage) on real national exam data.
Main Contribution
Created and released IGAKU QA: Japanese medical licensing exam questions and metadata (2018–2023).
Benchmarked GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4 in closed-book settings on the dataset.
Key Findings
GPT-4 passes all six years of the Japanese medical licensing exam (2018–2023) in closed-book multiple-choice format.
GPT-4 substantially underperforms the student-majority baseline on the same exams.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Passing exam (per year) | GPT-4 passed all years 2018–2023; GPT-3/ChatGPT did not | Student majority baseline | GPT-4 < student majority (e.g., 2018: 382 vs 472) | IGAKU QA (NMPQE 2018–2023) | Table 1: GPT-4 required/general scores meet passing thresholds across years | Table 1, §3.2 |
| Prohibited choices selected | ChatGPT sometimes exceeded 3 prohibited choices; GPT-4 selected 0–1 | Allowed ≤3 prohibited choices to pass | ChatGPT violated safety filter in multiple years; GPT-4 mostly complied | IGAKU QA (NMPQE 2018–2022) | Table 1 rows for P. (prohibited choices); Fig.1 & Fig.6 examples | Table 1, §3.3 |
What To Try In 7 Days
Run GPT-4 on a small subset of your country's domain-specific MCQs to compare to local experts.
Measure real token usage and API cost for your non-Latin text vs English translations.
Add a rules-based filter for prohibited/legal actions and test detection on model outputs.
Optimization Features
Token Efficiency
Infra Optimization
Reproducibility
Risks & Boundaries
Limitations
Black-box APIs: results may change with model updates and are not fully reproducible.
Potential data leakage: training data unknown; authors attempt mitigation with 2023 exam but risk remains.
When Not To Use
Do not use raw LLM outputs for clinical decisions or legal advice without expert review.
Avoid relying on closed-book LLMs where up-to-date local regulations or images are required.
Failure Modes
Selecting prohibited/illegal medical choices (safety/legal errors).
Tokenization causing shortened context and truncated prompts for long inputs.

