Overview
Provides clear, real-exam items that reveal gaps in language and visual reasoning; use it to prioritize model fixes and domain-specific retraining.
Citations31
Evidence Strength0.90
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
License: CC BY-NC-SA
At A Glance
Cost impact: 40%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.
Who Should Care
Summary TLDR
M3Exam is a 12,317-question benchmark built from real, official school exams across 9 languages and three school levels. About 23% of items require images. The dataset is designed to test multilingual, multimodal, and multilevel abilities of LLMs. Evaluations show current top models (GPT-4, ChatGPT, Claude) still struggle on low-resource and non-Latin languages and on complex image reasoning; multimodal models often miss fine image details and cross-image reasoning. Data and code are on GitHub.
Problem Statement
Standard NLP benchmarks emphasize narrow tasks or English data. They miss cultural context, images, and level-structured difficulty found in real human exams. We need a benchmark with real multilingual exam questions, image-based items, and explicit school levels to more realistically test LLM general intelligence.
Main Contribution
Constructed M3Exam: 12,317 multiple-choice questions from official exams in 9 languages spanning primary, middle, and high school.
Included multimodal content: 2,816 questions (~23%) require one or more images; images are clipped and paired with placeholders for evaluation.
Key Findings
M3Exam totals 12,317 multiple-choice questions across 9 languages.
About 23% of questions require image information to answer.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 72.92%, ChatGPT 57.57%, Claude 51.80% | random ~25% | GPT-4 +47.9 pp vs random | M3Exam (all languages) | Table 2: per-language and average accuracies | Table 2 |
| Per-language low performance (examples) | Javanese GPT-4 55.26%, Thai GPT-4 56.04% | English GPT-4 87.55% | Thai -31.51 pp vs English | M3Exam per-language | Table 2 language columns | Table 2 |
What To Try In 7 Days
Run your model on M3Exam slices for target languages to find language-specific failure modes.
Audit multimodal pipelines with the 2,816 image questions to measure fine-detail and cross-image reasoning gaps.
Use held-out few-shot dev examples provided to prototype prompt strategies and measure change.
Reproducibility
Risks & Boundaries
Limitations
Only multiple-choice items; not suitable for generative or open-ended writing evaluation.
Collected exams focus on nine countries/languages; other languages and exam formats are absent.
When Not To Use
Evaluating free-form generation, essays, or long-form answers.
Measuring conversational safety or long-horizon planning abilities.
Failure Modes
Poor performance on low-resource languages (e.g., Javanese) and some non-Latin scripts (e.g., Thai).
Multimodal models miss fine image details (axis labels, small numbers) and fail cross-image reasoning.

