Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
1
Why It Matters For Business
Plot2Code measures real-world value: turning charts into runnable plotting code. Use it to stress-test models that will automate report or dashboard generation before production rollout.
Summary TLDR
Plot2Code is a targeted benchmark to measure how well multi-modal LLMs (MLLMs) can read a plot image and produce executable matplotlib code that recreates it. The authors release 132 curated plot examples (293 subplots) with reference code and GPT-4–written plain-language instructions. They propose three automatic metrics—code pass rate, text-match ratio, and a GPT-4V visual rating—and evaluate 14 MLLMs. Results show even top models (GPT-4V, Claude-3, Gemini-Pro) leave clear gaps: visual fidelity, text accuracy, and handling of text-dense plots remain hard. The dataset and evaluation scripts aim to help improve vision+code capabilities.
Problem Statement
Current multimodal LLM benchmarks measure image understanding but not the ability to convert plot images into executable plotting code. We need a fair, automatic benchmark that tests image+text inputs and measures both code correctness and visual fidelity.
Main Contribution
A curated Plot2Code test set: 132 high-quality matplotlib examples (293 total subplots) with reference code and GPT-4–summarized instructions.
Two evaluation settings: Direct Asking (image in) and Conditional Asking (image + plain instruction or instruction-only for LLMs).
A metric suite: code pass rate (executable), text-match ratio (text fidelity), and GPT-4V overall visual rating (1–10).
A benchmark study over 14 MLLMs (closed- and open-source) and ablations on prompts, OCR tokens, and image resolution.
Public release of the dataset on HuggingFace for reproducibility and follow-up work.
Key Findings
Dataset size and complexity are modest but varied
Top closed-source MLLMs still score below human-level visual code reproduction
High code pass rates can coexist with imperfect visual/text fidelity
Open-source MLLMs lag behind closed-source counterparts
GPT-4V judgement correlates with human evaluation
Traditional pixel metrics (MSE, SSIM) fail to detect visual differences for plots
Results
GPT-4V overall rating (conditional asking)
Code pass rate (GPT-4V, conditional asking)
Text-match ratio (GPT-4V, conditional asking)
Best open-source model (Mini-Gemini-8x7B-HD) rating (conditional)
Who Should Care
What To Try In 7 Days
Run Plot2Code on your model to measure code pass rate and text-match separately.
Add OCR tokens or higher-resolution images to your model pipeline and compare pass rate improvements.
Use GPT-4V judgement to triage outputs that need human review for visual/text errors.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small curated test set (132 examples) limits coverage of all plot styles.
- Plots are limited to matplotlib examples from the official gallery; other plotting libraries not covered.
- Instructions and the automatic judge rely on GPT-4/GPT-4V, which may introduce judge bias or leakage.
- Some evaluation components depend on OCR and image resolution; results are sensitive to vision encoder quality.
When Not To Use
- Do not use as the only criterion for production model selection; pair with manual checks.
- Not suitable for interactive, multi-turn code refinement workflows without extension.
- Not meant for non-matplotlib plotting ecosystems or highly domain-specific visualizations.
Failure Modes
- Automatic judge bias: GPT-4V may favor certain styles or colors.
- OCR failures can lower text-match scores even when plots look correct.
- Models may produce syntactically runnable code that renders visually incorrect plots.
- Image resolution limits can hide fine-grained errors.
Core Entities
Models
- GPT-4V
- GPT-4
- Gemini-Pro
- Claude-3-Opus
- Claude-3-Sonnet
- Mini-Gemini-8x7B-HD
- Mini-Gemini-34B-HD
- Mini-Gemini-8x7B
- Mini-Gemini-2B
- DeepSeek-VL-7B
- LLaVA-1.6-Mistral-7B
- LLaVA-1.6-34B
Metrics
- code pass rate
- text-match ratio
- GPT-4V overall rating
- CLIP-Score
- MSE
- SSIM
Datasets
- Plot2Code
Benchmarks
- HumanEval
- MBPP
- MMCode
- Design2Code

