Plot2Code: a focused benchmark that asks multimodal LLMs to generate matplotlib code from scientific plots

May 13, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark provides clear, reproducible tests and metrics for plot-to-code tasks but needs broader scale and community adoption before replacing human evaluation.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 60%

Authors

Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

Links

Abstract / PDF / Data

Why It Matters For Business

Plot2Code measures real-world value: turning charts into runnable plotting code. Use it to stress-test models that will automate report or dashboard generation before production rollout.

Who Should Care

Summary TLDR

Plot2Code is a targeted benchmark to measure how well multi-modal LLMs (MLLMs) can read a plot image and produce executable matplotlib code that recreates it. The authors release 132 curated plot examples (293 subplots) with reference code and GPT-4–written plain-language instructions. They propose three automatic metrics—code pass rate, text-match ratio, and a GPT-4V visual rating—and evaluate 14 MLLMs. Results show even top models (GPT-4V, Claude-3, Gemini-Pro) leave clear gaps: visual fidelity, text accuracy, and handling of text-dense plots remain hard. The dataset and evaluation scripts aim to help improve vision+code capabilities.

Problem Statement

Current multimodal LLM benchmarks measure image understanding but not the ability to convert plot images into executable plotting code. We need a fair, automatic benchmark that tests image+text inputs and measures both code correctness and visual fidelity.

Main Contribution

A curated Plot2Code test set: 132 high-quality matplotlib examples (293 total subplots) with reference code and GPT-4–summarized instructions.

Two evaluation settings: Direct Asking (image in) and Conditional Asking (image + plain instruction or instruction-only for LLMs).

Key Findings

Dataset size and complexity are modest but varied

Numbers132 samples; 293 subplots; code tokens 401±281; avg text 23±13

Practical UseUse Plot2Code to benchmark medium-scale visual coding tasks; expect a mix of easy and hard plots rather than massive coverage.

Evidence RefTable 1, Sec. 3.3

Top closed-source MLLMs still score below human-level visual code reproduction

NumbersGPT-4V / Claude-3-Opus overall rating ≈ 7.68/10 (conditional)

Practical UseDon't assume production readiness for automated plot-to-code; expect manual checking and fixes after model output.

Evidence RefTable 3, Sec. 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4V overall rating (conditional asking)7.68/10Plot2Code (132 samples)GPT-4V overall rating 7.68 in Table 3Table 3
Code pass rate (GPT-4V, conditional asking)81.8%Plot2Code (132 samples)GPT-4V pass rate 81.8% in Table 3Table 3

What To Try In 7 Days

Run Plot2Code on your model to measure code pass rate and text-match separately.

Add OCR tokens or higher-resolution images to your model pipeline and compare pass rate improvements.

Use GPT-4V judgement to triage outputs that need human review for visual/text errors.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small curated test set (132 examples) limits coverage of all plot styles.

Plots are limited to matplotlib examples from the official gallery; other plotting libraries not covered.

When Not To Use

Do not use as the only criterion for production model selection; pair with manual checks.

Not suitable for interactive, multi-turn code refinement workflows without extension.

Failure Modes

Automatic judge bias: GPT-4V may favor certain styles or colors.

OCR failures can lower text-match scores even when plots look correct.

Core Entities

Models

GPT-4VGPT-4Gemini-ProClaude-3-OpusClaude-3-SonnetMini-Gemini-8x7B-HDMini-Gemini-34B-HDMini-Gemini-8x7BMini-Gemini-2BDeepSeek-VL-7BLLaVA-1.6-Mistral-7BLLaVA-1.6-34B

Metrics

code pass ratetext-match ratioGPT-4V overall ratingCLIP-ScoreMSESSIM

Datasets

Plot2Code

Benchmarks

HumanEvalMBPPMMCodeDesign2Code