Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
31
Why It Matters For Business
M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.
Summary TLDR
M3Exam is a 12,317-question benchmark built from real, official school exams across 9 languages and three school levels. About 23% of items require images. The dataset is designed to test multilingual, multimodal, and multilevel abilities of LLMs. Evaluations show current top models (GPT-4, ChatGPT, Claude) still struggle on low-resource and non-Latin languages and on complex image reasoning; multimodal models often miss fine image details and cross-image reasoning. Data and code are on GitHub.
Problem Statement
Standard NLP benchmarks emphasize narrow tasks or English data. They miss cultural context, images, and level-structured difficulty found in real human exams. We need a benchmark with real multilingual exam questions, image-based items, and explicit school levels to more realistically test LLM general intelligence.
Main Contribution
Constructed M3Exam: 12,317 multiple-choice questions from official exams in 9 languages spanning primary, middle, and high school.
Included multimodal content: 2,816 questions (~23%) require one or more images; images are clipped and paired with placeholders for evaluation.
Provided structured metadata per question: language, level, subject, options, ground truth, and held-out few-shot dev samples.
Evaluated a range of top text and multimodal LLMs (GPT-4, ChatGPT, Claude, BLOOM, Vicuna, BLIP-2, InstructBLIP, Fromage, OpenFlamingo) and analyzed multilingual, multimodal, and multilevel performance patterns.
Key Findings
M3Exam totals 12,317 multiple-choice questions across 9 languages.
About 23% of questions require image information to answer.
GPT-4 is the best-performing model tested but still far from perfect.
Multimodal models struggle with exam images and cross-image reasoning.
Model accuracy does not monotonically decline with higher school levels.
Results
Accuracy
Per-language low performance (examples)
Accuracy
Few-shot vs zero-shot effect (ChatGPT)
Who Should Care
What To Try In 7 Days
Run your model on M3Exam slices for target languages to find language-specific failure modes.
Audit multimodal pipelines with the 2,816 image questions to measure fine-detail and cross-image reasoning gaps.
Use held-out few-shot dev examples provided to prototype prompt strategies and measure change.
Reproducibility
License
- CC BY-NC-SA
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Only multiple-choice items; not suitable for generative or open-ended writing evaluation.
- Collected exams focus on nine countries/languages; other languages and exam formats are absent.
- Some scanned papers required OCR; residual OCR noise may remain despite checks.
When Not To Use
- Evaluating free-form generation, essays, or long-form answers.
- Measuring conversational safety or long-horizon planning abilities.
- Benchmarking language varieties not covered by the nine selected languages.
Failure Modes
- Poor performance on low-resource languages (e.g., Javanese) and some non-Latin scripts (e.g., Thai).
- Multimodal models miss fine image details (axis labels, small numbers) and fail cross-image reasoning.
- Few-shot prompts do not reliably improve performance; model gains depend on language and example choice.
Core Entities
Models
- GPT-4
- ChatGPT (gpt-3.5-turbo)
- Claude
- BLOOM
- Vicuna
- BLIP-2
- InstructBLIP
- Fromage
- OpenFlamingo
- Flan-T5
Metrics
- Accuracy
Datasets
- M3Exam
Benchmarks
- MMLU
- AGIEval
- C-Eval
Context Entities
Models
- PaLM2
- LLaMA
- Vicuna (fine-tuned LLaMA)
Metrics
- random baseline
- passing score (country-specific)
Datasets
- CommonCrawl (language ranking reference)
Benchmarks
- XTREME
- XTREME-R
- VQA

