M3Exam: 12k official exam questions in 9 languages (23% with images) to stress-test LLMs' multilingual and multimodal skills

June 8, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

31

Authors

Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing

Links

Abstract / PDF

Why It Matters For Business

M3Exam reveals real-world gaps in multilingual and multimodal LLMs: expect failures on low-resource languages and complex images, so validate models on representative data before deployment.

Summary TLDR

M3Exam is a 12,317-question benchmark built from real, official school exams across 9 languages and three school levels. About 23% of items require images. The dataset is designed to test multilingual, multimodal, and multilevel abilities of LLMs. Evaluations show current top models (GPT-4, ChatGPT, Claude) still struggle on low-resource and non-Latin languages and on complex image reasoning; multimodal models often miss fine image details and cross-image reasoning. Data and code are on GitHub.

Problem Statement

Standard NLP benchmarks emphasize narrow tasks or English data. They miss cultural context, images, and level-structured difficulty found in real human exams. We need a benchmark with real multilingual exam questions, image-based items, and explicit school levels to more realistically test LLM general intelligence.

Main Contribution

Constructed M3Exam: 12,317 multiple-choice questions from official exams in 9 languages spanning primary, middle, and high school.

Included multimodal content: 2,816 questions (~23%) require one or more images; images are clipped and paired with placeholders for evaluation.

Provided structured metadata per question: language, level, subject, options, ground truth, and held-out few-shot dev samples.

Evaluated a range of top text and multimodal LLMs (GPT-4, ChatGPT, Claude, BLOOM, Vicuna, BLIP-2, InstructBLIP, Fromage, OpenFlamingo) and analyzed multilingual, multimodal, and multilevel performance patterns.

Key Findings

M3Exam totals 12,317 multiple-choice questions across 9 languages.

Numbers12,317 total questions; 9 languages

About 23% of questions require image information to answer.

Numbers2,816 image questions (~23%)

GPT-4 is the best-performing model tested but still far from perfect.

NumbersGPT-4 average accuracy 72.92% across languages

Multimodal models struggle with exam images and cross-image reasoning.

NumbersBLIP-2 overall 49.06%, InstructBLIP 46.62%; Fromage 22.77%

Model accuracy does not monotonically decline with higher school levels.

NumbersNo clear decreasing trend across low/mid/high levels in Figure 4

Results

Accuracy

ValueGPT-4 72.92%, ChatGPT 57.57%, Claude 51.80%

Baselinerandom ~25%

Per-language low performance (examples)

ValueJavanese GPT-4 55.26%, Thai GPT-4 56.04%

BaselineEnglish GPT-4 87.55%

Accuracy

ValueBLIP-2 49.06%, InstructBLIP 46.62%, Fromage 22.77%

BaselineFlan-T5 (text-only) 48.30%, ChatGPT (text-only) 55.6%

Few-shot vs zero-shot effect (ChatGPT)

ValueFew-shot mixed: small changes, some up, some down (e.g., en 75.46 vs 75.98 zero-shot)

Baselinezero-shot monolingual prompts

Who Should Care

What To Try In 7 Days

Run your model on M3Exam slices for target languages to find language-specific failure modes.

Audit multimodal pipelines with the 2,816 image questions to measure fine-detail and cross-image reasoning gaps.

Use held-out few-shot dev examples provided to prototype prompt strategies and measure change.

Reproducibility

License

  • CC BY-NC-SA

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only multiple-choice items; not suitable for generative or open-ended writing evaluation.
  • Collected exams focus on nine countries/languages; other languages and exam formats are absent.
  • Some scanned papers required OCR; residual OCR noise may remain despite checks.

When Not To Use

  • Evaluating free-form generation, essays, or long-form answers.
  • Measuring conversational safety or long-horizon planning abilities.
  • Benchmarking language varieties not covered by the nine selected languages.

Failure Modes

  • Poor performance on low-resource languages (e.g., Javanese) and some non-Latin scripts (e.g., Thai).
  • Multimodal models miss fine image details (axis labels, small numbers) and fail cross-image reasoning.
  • Few-shot prompts do not reliably improve performance; model gains depend on language and example choice.

Core Entities

Models

  • GPT-4
  • ChatGPT (gpt-3.5-turbo)
  • Claude
  • BLOOM
  • Vicuna
  • BLIP-2
  • InstructBLIP
  • Fromage
  • OpenFlamingo
  • Flan-T5

Metrics

  • Accuracy

Datasets

  • M3Exam

Benchmarks

  • MMLU
  • AGIEval
  • C-Eval

Context Entities

Models

  • PaLM2
  • LLaMA
  • Vicuna (fine-tuned LLaMA)

Metrics

  • random baseline
  • passing score (country-specific)

Datasets

  • CommonCrawl (language ranking reference)

Benchmarks

  • XTREME
  • XTREME-R
  • VQA