Overview
The paper gives thorough quantitative results and human evaluations showing consistent gaps and failure modes; evidence is strongest for accuracy metrics but limited by manual GPT-4V evaluation and withheld test labels.
Citations11
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
MATHVISTA highlights where vision+math systems fail (OCR, shape detection, hallucination). Use it to benchmark assistants that read charts, analyze reports, or grade math-in-image tasks before deployment.
Who Should Care
Summary TLDR
MATHVISTA is a 6,141-example benchmark that measures math reasoning when problems include images (charts, geometry, plots, diagrams, natural scenes). The authors combine 28 prior multimodal datasets and add 3 new sets (IQTest, FunctionQA, PaperQA). They evaluate 12 leading models. GPT-4V leads at 49.9% accuracy, Multimodal Bard scores 34.8%, and human annotators score 60.3%. The benchmark exposes common failure modes: poor OCR/caption quality, hallucinations, shape detection gaps, and calculation errors. The paper also documents emergent behaviors in GPT-4V like self-verification and benefits from self-consistency.
Problem Statement
Current benchmarks test math mostly in text. Real math problems often need visual perception (plots, diagrams, tables) plus multi-step math. There was no consolidated, fine-grained benchmark to measure how foundation models combine vision and math reasoning. MATHVISTA fills that gap.
Main Contribution
A unified multimodal math benchmark (MATHVISTA) with 6,141 examples from 31 sources and 3 new datasets (IQTest, FunctionQA, PaperQA).
Fine-grained metadata per example: task type, visual context, reasoning types, grade level, and answer formats.
Key Findings
GPT-4V is the best model but still below humans.
GPT-4V substantially outperforms other multimodal models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 49.9% | Human 60.3% | -10.4pp vs human | MATHVISTA (testmini) | Table 2; §3.3 | Table 2 |
| Model gap vs next best | 15.1pp | Multimodal Bard 34.8% | GPT-4V +15.1pp | MATHVISTA (testmini) | Abstract; Table 2 | Table 2 |
What To Try In 7 Days
Run MATHVISTA testmini on your best multimodal model to get a baseline.
Compare raw LMM output vs LLM + caption/OCR augmented pipeline to find perception bottlenecks.
Measure hallucination rate in explanations; add a simple verifier or ensemble check for critical outputs.
Agent Features
Planning
Tool Use
Frameworks
Architectures
Collaboration
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Dataset annotations are heterogeneous across sources; only 85.6% of examples include original annotations (§B).
Test labels are withheld and evaluation is via an online platform, limiting full reproducibility of raw test scores (§2.4).
When Not To Use
Not suitable when your task is non-mathematical visual QA (MATHVISTA focuses on math reasoning).
Avoid using test set answers for model training to prevent benchmark leakage (test labels withheld).
Failure Modes
Hallucination: models invent facts not present in image or question (§3.5).
Poor OCR or caption quality leading to wrong inputs for LLM pipelines (§G.5).

