Plot2Code: a focused benchmark that asks multimodal LLMs to generate matplotlib code from scientific plots

Overview

Decision SnapshotNeeds Validation

The benchmark provides clear, reproducible tests and metrics for plot-to-code tasks but needs broader scale and community adoption before replacing human evaluation.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

Links

Abstract / PDF / Data

Why It Matters For Business

Plot2Code measures real-world value: turning charts into runnable plotting code. Use it to stress-test models that will automate report or dashboard generation before production rollout.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

Plot2Code is a targeted benchmark to measure how well multi-modal LLMs (MLLMs) can read a plot image and produce executable matplotlib code that recreates it. The authors release 132 curated plot examples (293 subplots) with reference code and GPT-4–written plain-language instructions. They propose three automatic metrics—code pass rate, text-match ratio, and a GPT-4V visual rating—and evaluate 14 MLLMs. Results show even top models (GPT-4V, Claude-3, Gemini-Pro) leave clear gaps: visual fidelity, text accuracy, and handling of text-dense plots remain hard. The dataset and evaluation scripts aim to help improve vision+code capabilities.

Problem Statement

Current multimodal LLM benchmarks measure image understanding but not the ability to convert plot images into executable plotting code. We need a fair, automatic benchmark that tests image+text inputs and measures both code correctness and visual fidelity.

Main Contribution

A curated Plot2Code test set: 132 high-quality matplotlib examples (293 total subplots) with reference code and GPT-4–summarized instructions.

Two evaluation settings: Direct Asking (image in) and Conditional Asking (image + plain instruction or instruction-only for LLMs).

Key Findings

Dataset size and complexity are modest but varied

Numbers132 samples; 293 subplots; code tokens 401±281; avg text 23±13

Practical UseUse Plot2Code to benchmark medium-scale visual coding tasks; expect a mix of easy and hard plots rather than massive coverage.

Evidence RefTable 1, Sec. 3.3

Top closed-source MLLMs still score below human-level visual code reproduction

NumbersGPT-4V / Claude-3-Opus overall rating ≈ 7.68/10 (conditional)

Practical UseDon't assume production readiness for automated plot-to-code; expect manual checking and fixes after model output.

Evidence RefTable 3, Sec. 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GPT-4V overall rating (conditional asking)	7.68/10	—	—	Plot2Code (132 samples)	GPT-4V overall rating 7.68 in Table 3	Table 3
Code pass rate (GPT-4V, conditional asking)	81.8%	—	—	Plot2Code (132 samples)	GPT-4V pass rate 81.8% in Table 3	Table 3

What To Try In 7 Days

Run Plot2Code on your model to measure code pass rate and text-match separately.

Add OCR tokens or higher-resolution images to your model pipeline and compare pass rate improvements.

Use GPT-4V judgement to triage outputs that need human review for visual/text errors.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/TencentARC/Plot2Code

Risks & Boundaries

Limitations

Small curated test set (132 examples) limits coverage of all plot styles.

Plots are limited to matplotlib examples from the official gallery; other plotting libraries not covered.

When Not To Use

Do not use as the only criterion for production model selection; pair with manual checks.

Not suitable for interactive, multi-turn code refinement workflows without extension.

Failure Modes

Automatic judge bias: GPT-4V may favor certain styles or colors.

OCR failures can lower text-match scores even when plots look correct.

Core Entities

Models

GPT-4VGPT-4Gemini-ProClaude-3-OpusClaude-3-SonnetMini-Gemini-8x7B-HDMini-Gemini-34B-HDMini-Gemini-8x7BMini-Gemini-2BDeepSeek-VL-7BLLaVA-1.6-Mistral-7BLLaVA-1.6-34B

Metrics

code pass ratetext-match ratioGPT-4V overall ratingCLIP-ScoreMSESSIM

Datasets

Plot2Code

Benchmarks

HumanEvalMBPPMMCodeDesign2Code

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and complexity are modest but varied

Top closed-source MLLMs still score below human-level visual code reproduction

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-