Overview
The benchmark provides clear, reproducible tests and metrics for plot-to-code tasks but needs broader scale and community adoption before replacing human evaluation.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 60%
Why It Matters For Business
Plot2Code measures real-world value: turning charts into runnable plotting code. Use it to stress-test models that will automate report or dashboard generation before production rollout.
Who Should Care
Summary TLDR
Plot2Code is a targeted benchmark to measure how well multi-modal LLMs (MLLMs) can read a plot image and produce executable matplotlib code that recreates it. The authors release 132 curated plot examples (293 subplots) with reference code and GPT-4–written plain-language instructions. They propose three automatic metrics—code pass rate, text-match ratio, and a GPT-4V visual rating—and evaluate 14 MLLMs. Results show even top models (GPT-4V, Claude-3, Gemini-Pro) leave clear gaps: visual fidelity, text accuracy, and handling of text-dense plots remain hard. The dataset and evaluation scripts aim to help improve vision+code capabilities.
Problem Statement
Current multimodal LLM benchmarks measure image understanding but not the ability to convert plot images into executable plotting code. We need a fair, automatic benchmark that tests image+text inputs and measures both code correctness and visual fidelity.
Main Contribution
A curated Plot2Code test set: 132 high-quality matplotlib examples (293 total subplots) with reference code and GPT-4–summarized instructions.
Two evaluation settings: Direct Asking (image in) and Conditional Asking (image + plain instruction or instruction-only for LLMs).
Key Findings
Dataset size and complexity are modest but varied
Top closed-source MLLMs still score below human-level visual code reproduction
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GPT-4V overall rating (conditional asking) | 7.68/10 | — | — | Plot2Code (132 samples) | GPT-4V overall rating 7.68 in Table 3 | Table 3 |
| Code pass rate (GPT-4V, conditional asking) | 81.8% | — | — | Plot2Code (132 samples) | GPT-4V pass rate 81.8% in Table 3 | Table 3 |
What To Try In 7 Days
Run Plot2Code on your model to measure code pass rate and text-match separately.
Add OCR tokens or higher-resolution images to your model pipeline and compare pass rate improvements.
Use GPT-4V judgement to triage outputs that need human review for visual/text errors.
Reproducibility
Risks & Boundaries
Limitations
Small curated test set (132 examples) limits coverage of all plot styles.
Plots are limited to matplotlib examples from the official gallery; other plotting libraries not covered.
When Not To Use
Do not use as the only criterion for production model selection; pair with manual checks.
Not suitable for interactive, multi-turn code refinement workflows without extension.
Failure Modes
Automatic judge bias: GPT-4V may favor certain styles or colors.
OCR failures can lower text-match scores even when plots look correct.

