M3Exam: 12k official exam questions in 9 languages (23% with images) to stress-test LLMs' multilingual and multimodal skills

Overview

Decision SnapshotReady For Pilot

Provides clear, real-exam items that reveal gaps in language and visual reasoning; use it to prioritize model fixes and domain-specific retraining.

Citations31

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY-NC-SA

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 60%

Authors

Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

M3Exam is a 12,317-question benchmark built from real, official school exams across 9 languages and three school levels. About 23% of items require images. The dataset is designed to test multilingual, multimodal, and multilevel abilities of LLMs. Evaluations show current top models (GPT-4, ChatGPT, Claude) still struggle on low-resource and non-Latin languages and on complex image reasoning; multimodal models often miss fine image details and cross-image reasoning. Data and code are on GitHub.

Problem Statement

Standard NLP benchmarks emphasize narrow tasks or English data. They miss cultural context, images, and level-structured difficulty found in real human exams. We need a benchmark with real multilingual exam questions, image-based items, and explicit school levels to more realistically test LLM general intelligence.

Main Contribution

Constructed M3Exam: 12,317 multiple-choice questions from official exams in 9 languages spanning primary, middle, and high school.

Included multimodal content: 2,816 questions (~23%) require one or more images; images are clipped and paired with placeholders for evaluation.

Key Findings

M3Exam totals 12,317 multiple-choice questions across 9 languages.

Numbers12,317 total questions; 9 languages

Practical UseUse M3Exam for broader evaluation beyond English; it gives scale and cross-language coverage for stress-testing LLMs.

Evidence RefAbstract, Section 2.4, Table 1

About 23% of questions require image information to answer.

Numbers2,816 image questions (~23%)

Practical UseInclude multimodal evaluation in your validation: text-only checks miss nearly a quarter of exam-style items.

Evidence RefAbstract, Section 2.4, Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 72.92%, ChatGPT 57.57%, Claude 51.80%	random ~25%	GPT-4 +47.9 pp vs random	M3Exam (all languages)	Table 2: per-language and average accuracies	Table 2
Per-language low performance (examples)	Javanese GPT-4 55.26%, Thai GPT-4 56.04%	English GPT-4 87.55%	Thai -31.51 pp vs English	M3Exam per-language	Table 2 language columns	Table 2

What To Try In 7 Days

Run your model on M3Exam slices for target languages to find language-specific failure modes.

Audit multimodal pipelines with the 2,816 image questions to measure fine-detail and cross-image reasoning gaps.

Use held-out few-shot dev examples provided to prototype prompt strategies and measure change.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseCC BY-NC-SA

Code URLs

https://github.com/DAMO-NLP-SG/M3Exam

Data URLs

https://github.com/DAMO-NLP-SG/M3Exam

Risks & Boundaries

Limitations

Only multiple-choice items; not suitable for generative or open-ended writing evaluation.

Collected exams focus on nine countries/languages; other languages and exam formats are absent.

When Not To Use

Evaluating free-form generation, essays, or long-form answers.

Measuring conversational safety or long-horizon planning abilities.

Failure Modes

Poor performance on low-resource languages (e.g., Javanese) and some non-Latin scripts (e.g., Thai).

Multimodal models miss fine image details (axis labels, small numbers) and fail cross-image reasoning.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)ClaudeBLOOMVicunaBLIP-2InstructBLIPFromageOpenFlamingoFlan-T5

Metrics

Accuracy

Datasets

M3Exam

Benchmarks

MMLUAGIEvalC-Eval

Context Entities

Models

PaLM2LLaMAVicuna (fine-tuned LLaMA)

Metrics

random baselinepassing score (country-specific)

Datasets

CommonCrawl (language ranking reference)

Benchmarks

XTREMEXTREME-RVQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

M3Exam totals 12,317 multiple-choice questions across 9 languages.

About 23% of questions require image information to answer.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-