MATHVISTA: a 6k multimodal benchmark showing GPT-4V is strongest but still ~10% behind humans

October 3, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

11

Authors

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

Links

Abstract / PDF

Why It Matters For Business

MATHVISTA highlights where vision+math systems fail (OCR, shape detection, hallucination). Use it to benchmark assistants that read charts, analyze reports, or grade math-in-image tasks before deployment.

Summary TLDR

MATHVISTA is a 6,141-example benchmark that measures math reasoning when problems include images (charts, geometry, plots, diagrams, natural scenes). The authors combine 28 prior multimodal datasets and add 3 new sets (IQTest, FunctionQA, PaperQA). They evaluate 12 leading models. GPT-4V leads at 49.9% accuracy, Multimodal Bard scores 34.8%, and human annotators score 60.3%. The benchmark exposes common failure modes: poor OCR/caption quality, hallucinations, shape detection gaps, and calculation errors. The paper also documents emergent behaviors in GPT-4V like self-verification and benefits from self-consistency.

Problem Statement

Current benchmarks test math mostly in text. Real math problems often need visual perception (plots, diagrams, tables) plus multi-step math. There was no consolidated, fine-grained benchmark to measure how foundation models combine vision and math reasoning. MATHVISTA fills that gap.

Main Contribution

A unified multimodal math benchmark (MATHVISTA) with 6,141 examples from 31 sources and 3 new datasets (IQTest, FunctionQA, PaperQA).

Fine-grained metadata per example: task type, visual context, reasoning types, grade level, and answer formats.

A large evaluation of 12 foundation models, including manual GPT-4V evaluation plus analyses of hallucination, self-verification, and self-consistency.

Key Findings

GPT-4V is the best model but still below humans.

NumbersGPT-4V 49.9% vs human 60.3% (gap 10.4%)

GPT-4V substantially outperforms other multimodal models.

NumbersGPT-4V 49.9% vs Multimodal Bard 34.8% (+15.1%)

Text-only LLMs and simple visual augmentations fall far short.

Numbers2-shot CoT GPT-4 (text-only) 29.2%; PoT GPT-4 with Bard captions+OCR 33.9%

Dataset size and composition.

Numbers6,141 examples total; 736 newly annotated examples

Hallucination is a major failure mode for generative LMMs.

NumbersHuman evaluation: 49.6% of Bard's wrong explanations contain hallucinations

Emergent verification behaviors in GPT-4V can help.

NumbersSelf-verification and self-consistency reduce perception and calculation errors in sampled examples

Results

Accuracy

Value49.9%

BaselineHuman 60.3%

Model gap vs next best

Value15.1pp

BaselineMultimodal Bard 34.8%

Text-only best (CoT GPT-4)

Value29.2%

BaselineRandom chance 17.9%

Augmented LLM (PoT GPT-4 + captions+OCR)

Value33.9%

Baseline2-shot PoT GPT-4 (Q, I_c, I_t)

Dataset size

Value6,141 examples

Baseline736 newly annotated

Who Should Care

What To Try In 7 Days

Run MATHVISTA testmini on your best multimodal model to get a baseline.

Compare raw LMM output vs LLM + caption/OCR augmented pipeline to find perception bottlenecks.

Measure hallucination rate in explanations; add a simple verifier or ensemble check for critical outputs.

Agent Features

Planning

  • self-verification (model inspects steps)
  • self-consistency (sample reasoning trajectories)

Tool Use

  • OCR (EasyOCR)
  • image captioning (Bard)
  • program-execution (PoT generates code)

Frameworks

  • Chain-of-Thought (CoT)
  • Program-of-Thought (PoT)
  • self-consistency

Architectures

  • LLM
  • LMM (vision+language)
  • vision encoder + LLM pipeline

Collaboration

  • human-in-the-loop evaluation (AMT)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset annotations are heterogeneous across sources; only 85.6% of examples include original annotations (§B).
  • Test labels are withheld and evaluation is via an online platform, limiting full reproducibility of raw test scores (§2.4).
  • Coverage gaps remain: some visual/math problem types may be underrepresented (§B).

When Not To Use

  • Not suitable when your task is non-mathematical visual QA (MATHVISTA focuses on math reasoning).
  • Avoid using test set answers for model training to prevent benchmark leakage (test labels withheld).
  • Not the right benchmark for assessing pure vision tasks like object detection without math reasoning.

Failure Modes

  • Hallucination: models invent facts not present in image or question (§3.5).
  • Poor OCR or caption quality leading to wrong inputs for LLM pipelines (§G.5).
  • Weak shape or symbol detection for geometry tasks.
  • Calculation errors in generated explanations despite correct perceived inputs.

Core Entities

Models

  • GPT-4V
  • Multimodal Bard
  • GPT-4
  • ChatGPT
  • Claude-2
  • LLaVA-LLaMA-2-13B
  • LLaMA-Adapter-V2-7B
  • InstructBLIP-Vicuna-7B
  • miniGPT-4-LLaMA-2-7B
  • mPLUG-Owl-LLaMA-7B
  • IDEFICS-9B-Instruct
  • LLaVAR

Metrics

  • Accuracy

Datasets

  • MATHVISTA
  • IQTest (new)
  • FunctionQA (new)
  • PaperQA (new)
  • Geometry3K
  • CLEVR-Math
  • IconQA
  • ChartQA
  • FigureQA
  • DVQA
  • PlotQA
  • TabMWP
  • SciBench
  • TheoremQA

Benchmarks

  • MATHVISTA