Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
11
Why It Matters For Business
MATHVISTA highlights where vision+math systems fail (OCR, shape detection, hallucination). Use it to benchmark assistants that read charts, analyze reports, or grade math-in-image tasks before deployment.
Summary TLDR
MATHVISTA is a 6,141-example benchmark that measures math reasoning when problems include images (charts, geometry, plots, diagrams, natural scenes). The authors combine 28 prior multimodal datasets and add 3 new sets (IQTest, FunctionQA, PaperQA). They evaluate 12 leading models. GPT-4V leads at 49.9% accuracy, Multimodal Bard scores 34.8%, and human annotators score 60.3%. The benchmark exposes common failure modes: poor OCR/caption quality, hallucinations, shape detection gaps, and calculation errors. The paper also documents emergent behaviors in GPT-4V like self-verification and benefits from self-consistency.
Problem Statement
Current benchmarks test math mostly in text. Real math problems often need visual perception (plots, diagrams, tables) plus multi-step math. There was no consolidated, fine-grained benchmark to measure how foundation models combine vision and math reasoning. MATHVISTA fills that gap.
Main Contribution
A unified multimodal math benchmark (MATHVISTA) with 6,141 examples from 31 sources and 3 new datasets (IQTest, FunctionQA, PaperQA).
Fine-grained metadata per example: task type, visual context, reasoning types, grade level, and answer formats.
A large evaluation of 12 foundation models, including manual GPT-4V evaluation plus analyses of hallucination, self-verification, and self-consistency.
Key Findings
GPT-4V is the best model but still below humans.
GPT-4V substantially outperforms other multimodal models.
Text-only LLMs and simple visual augmentations fall far short.
Dataset size and composition.
Hallucination is a major failure mode for generative LMMs.
Emergent verification behaviors in GPT-4V can help.
Results
Accuracy
Model gap vs next best
Text-only best (CoT GPT-4)
Augmented LLM (PoT GPT-4 + captions+OCR)
Dataset size
Who Should Care
What To Try In 7 Days
Run MATHVISTA testmini on your best multimodal model to get a baseline.
Compare raw LMM output vs LLM + caption/OCR augmented pipeline to find perception bottlenecks.
Measure hallucination rate in explanations; add a simple verifier or ensemble check for critical outputs.
Agent Features
Planning
- self-verification (model inspects steps)
- self-consistency (sample reasoning trajectories)
Tool Use
- OCR (EasyOCR)
- image captioning (Bard)
- program-execution (PoT generates code)
Frameworks
- Chain-of-Thought (CoT)
- Program-of-Thought (PoT)
- self-consistency
Architectures
- LLM
- LMM (vision+language)
- vision encoder + LLM pipeline
Collaboration
- human-in-the-loop evaluation (AMT)
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset annotations are heterogeneous across sources; only 85.6% of examples include original annotations (§B).
- Test labels are withheld and evaluation is via an online platform, limiting full reproducibility of raw test scores (§2.4).
- Coverage gaps remain: some visual/math problem types may be underrepresented (§B).
When Not To Use
- Not suitable when your task is non-mathematical visual QA (MATHVISTA focuses on math reasoning).
- Avoid using test set answers for model training to prevent benchmark leakage (test labels withheld).
- Not the right benchmark for assessing pure vision tasks like object detection without math reasoning.
Failure Modes
- Hallucination: models invent facts not present in image or question (§3.5).
- Poor OCR or caption quality leading to wrong inputs for LLM pipelines (§G.5).
- Weak shape or symbol detection for geometry tasks.
- Calculation errors in generated explanations despite correct perceived inputs.
Core Entities
Models
- GPT-4V
- Multimodal Bard
- GPT-4
- ChatGPT
- Claude-2
- LLaVA-LLaMA-2-13B
- LLaMA-Adapter-V2-7B
- InstructBLIP-Vicuna-7B
- miniGPT-4-LLaMA-2-7B
- mPLUG-Owl-LLaMA-7B
- IDEFICS-9B-Instruct
- LLaVAR
Metrics
- Accuracy
Datasets
- MATHVISTA
- IQTest (new)
- FunctionQA (new)
- PaperQA (new)
- Geometry3K
- CLEVR-Math
- IconQA
- ChartQA
- FigureQA
- DVQA
- PlotQA
- TabMWP
- SciBench
- TheoremQA
Benchmarks
- MATHVISTA

