MATHVISTA: a 6k multimodal benchmark showing GPT-4V is strongest but still ~10% behind humans

October 3, 20238 min

Overview

Decision SnapshotReady For Pilot

The paper gives thorough quantitative results and human evaluations showing consistent gaps and failure modes; evidence is strongest for accuracy metrics but limited by manual GPT-4V evaluation and withheld test labels.

Citations11

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

Links

Abstract / PDF / Data

Why It Matters For Business

MATHVISTA highlights where vision+math systems fail (OCR, shape detection, hallucination). Use it to benchmark assistants that read charts, analyze reports, or grade math-in-image tasks before deployment.

Who Should Care

Summary TLDR

MATHVISTA is a 6,141-example benchmark that measures math reasoning when problems include images (charts, geometry, plots, diagrams, natural scenes). The authors combine 28 prior multimodal datasets and add 3 new sets (IQTest, FunctionQA, PaperQA). They evaluate 12 leading models. GPT-4V leads at 49.9% accuracy, Multimodal Bard scores 34.8%, and human annotators score 60.3%. The benchmark exposes common failure modes: poor OCR/caption quality, hallucinations, shape detection gaps, and calculation errors. The paper also documents emergent behaviors in GPT-4V like self-verification and benefits from self-consistency.

Problem Statement

Current benchmarks test math mostly in text. Real math problems often need visual perception (plots, diagrams, tables) plus multi-step math. There was no consolidated, fine-grained benchmark to measure how foundation models combine vision and math reasoning. MATHVISTA fills that gap.

Main Contribution

A unified multimodal math benchmark (MATHVISTA) with 6,141 examples from 31 sources and 3 new datasets (IQTest, FunctionQA, PaperQA).

Fine-grained metadata per example: task type, visual context, reasoning types, grade level, and answer formats.

Key Findings

GPT-4V is the best model but still below humans.

NumbersGPT-4V 49.9% vs human 60.3% (gap 10.4%)

Practical UseGPT-4V is the strongest off-the-shelf option today for multimodal math tasks, but expect ~10% more errors than careful human answers; verify critical outputs.

Evidence RefTable 2; §3.3

GPT-4V substantially outperforms other multimodal models.

NumbersGPT-4V 49.9% vs Multimodal Bard 34.8% (+15.1%)

Practical UseIf you must pick one model for visual math problems, GPT-4V gives a large accuracy boost over Bard and open-source LMMs.

Evidence RefTable 2; §3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy49.9%Human 60.3%-10.4pp vs humanMATHVISTA (testmini)Table 2; §3.3Table 2
Model gap vs next best15.1ppMultimodal Bard 34.8%GPT-4V +15.1ppMATHVISTA (testmini)Abstract; Table 2Table 2

What To Try In 7 Days

Run MATHVISTA testmini on your best multimodal model to get a baseline.

Compare raw LMM output vs LLM + caption/OCR augmented pipeline to find perception bottlenecks.

Measure hallucination rate in explanations; add a simple verifier or ensemble check for critical outputs.

Agent Features

Planning
self-verification (model inspects steps)self-consistency (sample reasoning trajectories)
Tool Use
OCR (EasyOCR)image captioning (Bard)program-execution (PoT generates code)
Frameworks
Chain-of-Thought (CoT)Program-of-Thought (PoT)self-consistency
Architectures
LLMLMM (vision+language)vision encoder + LLM pipeline
Collaboration
human-in-the-loop evaluation (AMT)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Dataset annotations are heterogeneous across sources; only 85.6% of examples include original annotations (§B).

Test labels are withheld and evaluation is via an online platform, limiting full reproducibility of raw test scores (§2.4).

When Not To Use

Not suitable when your task is non-mathematical visual QA (MATHVISTA focuses on math reasoning).

Avoid using test set answers for model training to prevent benchmark leakage (test labels withheld).

Failure Modes

Hallucination: models invent facts not present in image or question (§3.5).

Poor OCR or caption quality leading to wrong inputs for LLM pipelines (§G.5).

Core Entities

Models

GPT-4VMultimodal BardGPT-4ChatGPTClaude-2LLaVA-LLaMA-2-13BLLaMA-Adapter-V2-7BInstructBLIP-Vicuna-7BminiGPT-4-LLaMA-2-7BmPLUG-Owl-LLaMA-7BIDEFICS-9B-InstructLLaVAR

Metrics

Accuracy

Datasets

MATHVISTAIQTest (new)FunctionQA (new)PaperQA (new)Geometry3KCLEVR-MathIconQAChartQAFigureQADVQAPlotQATabMWPSciBenchTheoremQA

Benchmarks

MATHVISTA