M3Exam: 12k official exam questions in 9 languages (23% with images) to stress-test LLMs' multilingual and multimodal skills

June 8, 20237 min

Overview

Decision SnapshotReady For Pilot

Provides clear, real-exam items that reveal gaps in language and visual reasoning; use it to prioritize model fixes and domain-specific retraining.

Citations31

Evidence Strength0.90

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

License: CC BY-NC-SA

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 60%

Authors

Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.

Who Should Care

Summary TLDR

M3Exam is a 12,317-question benchmark built from real, official school exams across 9 languages and three school levels. About 23% of items require images. The dataset is designed to test multilingual, multimodal, and multilevel abilities of LLMs. Evaluations show current top models (GPT-4, ChatGPT, Claude) still struggle on low-resource and non-Latin languages and on complex image reasoning; multimodal models often miss fine image details and cross-image reasoning. Data and code are on GitHub.

Problem Statement

Standard NLP benchmarks emphasize narrow tasks or English data. They miss cultural context, images, and level-structured difficulty found in real human exams. We need a benchmark with real multilingual exam questions, image-based items, and explicit school levels to more realistically test LLM general intelligence.

Main Contribution

Constructed M3Exam: 12,317 multiple-choice questions from official exams in 9 languages spanning primary, middle, and high school.

Included multimodal content: 2,816 questions (~23%) require one or more images; images are clipped and paired with placeholders for evaluation.

Key Findings

M3Exam totals 12,317 multiple-choice questions across 9 languages.

Numbers12,317 total questions; 9 languages

Practical UseUse M3Exam for broader evaluation beyond English; it gives scale and cross-language coverage for stress-testing LLMs.

Evidence RefAbstract, Section 2.4, Table 1

About 23% of questions require image information to answer.

Numbers2,816 image questions (~23%)

Practical UseInclude multimodal evaluation in your validation: text-only checks miss nearly a quarter of exam-style items.

Evidence RefAbstract, Section 2.4, Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 72.92%, ChatGPT 57.57%, Claude 51.80%random ~25%GPT-4 +47.9 pp vs randomM3Exam (all languages)Table 2: per-language and average accuraciesTable 2
Per-language low performance (examples)Javanese GPT-4 55.26%, Thai GPT-4 56.04%English GPT-4 87.55%Thai -31.51 pp vs EnglishM3Exam per-languageTable 2 language columnsTable 2

What To Try In 7 Days

Run your model on M3Exam slices for target languages to find language-specific failure modes.

Audit multimodal pipelines with the 2,816 image questions to measure fine-detail and cross-image reasoning gaps.

Use held-out few-shot dev examples provided to prototype prompt strategies and measure change.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseCC BY-NC-SA

Risks & Boundaries

Limitations

Only multiple-choice items; not suitable for generative or open-ended writing evaluation.

Collected exams focus on nine countries/languages; other languages and exam formats are absent.

When Not To Use

Evaluating free-form generation, essays, or long-form answers.

Measuring conversational safety or long-horizon planning abilities.

Failure Modes

Poor performance on low-resource languages (e.g., Javanese) and some non-Latin scripts (e.g., Thai).

Multimodal models miss fine image details (axis labels, small numbers) and fail cross-image reasoning.

Core Entities

Models

GPT-4ChatGPT (gpt-3.5-turbo)ClaudeBLOOMVicunaBLIP-2InstructBLIPFromageOpenFlamingoFlan-T5

Metrics

Accuracy

Datasets

M3Exam

Benchmarks

MMLUAGIEvalC-Eval

Context Entities

Models

PaLM2LLaMAVicuna (fine-tuned LLaMA)

Metrics

random baselinepassing score (country-specific)

Datasets

CommonCrawl (language ranking reference)

Benchmarks

XTREMEXTREME-RVQA