MATHVISTA: a 6k multimodal benchmark showing GPT-4V is strongest but still ~10% behind humans

Overview

Decision SnapshotReady For Pilot

The paper gives thorough quantitative results and human evaluations showing consistent gaps and failure modes; evidence is strongest for accuracy metrics but limited by manual GPT-4V evaluation and withheld test labels.

Citations11

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

Links

Abstract / PDF / Data

Why It Matters For Business

MATHVISTA highlights where vision+math systems fail (OCR, shape detection, hallucination). Use it to benchmark assistants that read charts, analyze reports, or grade math-in-image tasks before deployment.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

MATHVISTA is a 6,141-example benchmark that measures math reasoning when problems include images (charts, geometry, plots, diagrams, natural scenes). The authors combine 28 prior multimodal datasets and add 3 new sets (IQTest, FunctionQA, PaperQA). They evaluate 12 leading models. GPT-4V leads at 49.9% accuracy, Multimodal Bard scores 34.8%, and human annotators score 60.3%. The benchmark exposes common failure modes: poor OCR/caption quality, hallucinations, shape detection gaps, and calculation errors. The paper also documents emergent behaviors in GPT-4V like self-verification and benefits from self-consistency.

Problem Statement

Current benchmarks test math mostly in text. Real math problems often need visual perception (plots, diagrams, tables) plus multi-step math. There was no consolidated, fine-grained benchmark to measure how foundation models combine vision and math reasoning. MATHVISTA fills that gap.

Main Contribution

A unified multimodal math benchmark (MATHVISTA) with 6,141 examples from 31 sources and 3 new datasets (IQTest, FunctionQA, PaperQA).

Fine-grained metadata per example: task type, visual context, reasoning types, grade level, and answer formats.

Key Findings

GPT-4V is the best model but still below humans.

NumbersGPT-4V 49.9% vs human 60.3% (gap 10.4%)

Practical UseGPT-4V is the strongest off-the-shelf option today for multimodal math tasks, but expect ~10% more errors than careful human answers; verify critical outputs.

Evidence RefTable 2; §3.3

GPT-4V substantially outperforms other multimodal models.

NumbersGPT-4V 49.9% vs Multimodal Bard 34.8% (+15.1%)

Practical UseIf you must pick one model for visual math problems, GPT-4V gives a large accuracy boost over Bard and open-source LMMs.

Evidence RefTable 2; §3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	49.9%	Human 60.3%	-10.4pp vs human	MATHVISTA (testmini)	Table 2; §3.3	Table 2
Model gap vs next best	15.1pp	Multimodal Bard 34.8%	GPT-4V +15.1pp	MATHVISTA (testmini)	Abstract; Table 2	Table 2

What To Try In 7 Days

Run MATHVISTA testmini on your best multimodal model to get a baseline.

Compare raw LMM output vs LLM + caption/OCR augmented pipeline to find perception bottlenecks.

Measure hallucination rate in explanations; add a simple verifier or ensemble check for critical outputs.

Agent Features

Planning

self-verification (model inspects steps)self-consistency (sample reasoning trajectories)

Tool Use

OCR (EasyOCR)image captioning (Bard)program-execution (PoT generates code)

Frameworks

Chain-of-Thought (CoT)Program-of-Thought (PoT)self-consistency

Architectures

LLMLMM (vision+language)vision encoder + LLM pipeline

Collaboration

human-in-the-loop evaluation (AMT)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://mathvista.github.io

Risks & Boundaries

Limitations

Dataset annotations are heterogeneous across sources; only 85.6% of examples include original annotations (§B).

Test labels are withheld and evaluation is via an online platform, limiting full reproducibility of raw test scores (§2.4).

When Not To Use

Not suitable when your task is non-mathematical visual QA (MATHVISTA focuses on math reasoning).

Avoid using test set answers for model training to prevent benchmark leakage (test labels withheld).

Failure Modes

Hallucination: models invent facts not present in image or question (§3.5).

Poor OCR or caption quality leading to wrong inputs for LLM pipelines (§G.5).

Core Entities

Models

GPT-4VMultimodal BardGPT-4ChatGPTClaude-2LLaVA-LLaMA-2-13BLLaMA-Adapter-V2-7BInstructBLIP-Vicuna-7BminiGPT-4-LLaMA-2-7BmPLUG-Owl-LLaMA-7BIDEFICS-9B-InstructLLaVAR

Metrics

Accuracy

Datasets

MATHVISTAIQTest (new)FunctionQA (new)PaperQA (new)Geometry3KCLEVR-MathIconQAChartQAFigureQADVQAPlotQATabMWPSciBenchTheoremQA

Benchmarks

MATHVISTA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4V is the best model but still below humans.

GPT-4V substantially outperforms other multimodal models.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-