Plot2Code: a focused benchmark that asks multimodal LLMs to generate matplotlib code from scientific plots

May 13, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

1

Authors

Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

Links

Abstract / PDF

Why It Matters For Business

Plot2Code measures real-world value: turning charts into runnable plotting code. Use it to stress-test models that will automate report or dashboard generation before production rollout.

Summary TLDR

Plot2Code is a targeted benchmark to measure how well multi-modal LLMs (MLLMs) can read a plot image and produce executable matplotlib code that recreates it. The authors release 132 curated plot examples (293 subplots) with reference code and GPT-4–written plain-language instructions. They propose three automatic metrics—code pass rate, text-match ratio, and a GPT-4V visual rating—and evaluate 14 MLLMs. Results show even top models (GPT-4V, Claude-3, Gemini-Pro) leave clear gaps: visual fidelity, text accuracy, and handling of text-dense plots remain hard. The dataset and evaluation scripts aim to help improve vision+code capabilities.

Problem Statement

Current multimodal LLM benchmarks measure image understanding but not the ability to convert plot images into executable plotting code. We need a fair, automatic benchmark that tests image+text inputs and measures both code correctness and visual fidelity.

Main Contribution

A curated Plot2Code test set: 132 high-quality matplotlib examples (293 total subplots) with reference code and GPT-4–summarized instructions.

Two evaluation settings: Direct Asking (image in) and Conditional Asking (image + plain instruction or instruction-only for LLMs).

A metric suite: code pass rate (executable), text-match ratio (text fidelity), and GPT-4V overall visual rating (1–10).

A benchmark study over 14 MLLMs (closed- and open-source) and ablations on prompts, OCR tokens, and image resolution.

Public release of the dataset on HuggingFace for reproducibility and follow-up work.

Key Findings

Dataset size and complexity are modest but varied

Numbers132 samples; 293 subplots; code tokens 401±281; avg text 23±13

Top closed-source MLLMs still score below human-level visual code reproduction

NumbersGPT-4V / Claude-3-Opus overall rating ≈ 7.68/10 (conditional)

High code pass rates can coexist with imperfect visual/text fidelity

NumbersGPT-4V pass rate 81.8% but text-match 70.7% (conditional)

Open-source MLLMs lag behind closed-source counterparts

NumbersBest open-source (Mini-Gemini-8x7B-HD) rating 6.08, pass rate 58.4% (conditional)

GPT-4V judgement correlates with human evaluation

NumbersKendall τ=0.437; Pearson r=0.479; Spearman ρ=0.469 (n=920)

Traditional pixel metrics (MSE, SSIM) fail to detect visual differences for plots

NumbersMSE p=0.22, SSIM p=0.21 (not significant) vs GPT-4V p=1.22×10^-11

Results

GPT-4V overall rating (conditional asking)

Value7.68/10

Code pass rate (GPT-4V, conditional asking)

Value81.8%

Text-match ratio (GPT-4V, conditional asking)

Value70.7%

Best open-source model (Mini-Gemini-8x7B-HD) rating (conditional)

Value6.08/10

Who Should Care

What To Try In 7 Days

Run Plot2Code on your model to measure code pass rate and text-match separately.

Add OCR tokens or higher-resolution images to your model pipeline and compare pass rate improvements.

Use GPT-4V judgement to triage outputs that need human review for visual/text errors.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small curated test set (132 examples) limits coverage of all plot styles.
  • Plots are limited to matplotlib examples from the official gallery; other plotting libraries not covered.
  • Instructions and the automatic judge rely on GPT-4/GPT-4V, which may introduce judge bias or leakage.
  • Some evaluation components depend on OCR and image resolution; results are sensitive to vision encoder quality.

When Not To Use

  • Do not use as the only criterion for production model selection; pair with manual checks.
  • Not suitable for interactive, multi-turn code refinement workflows without extension.
  • Not meant for non-matplotlib plotting ecosystems or highly domain-specific visualizations.

Failure Modes

  • Automatic judge bias: GPT-4V may favor certain styles or colors.
  • OCR failures can lower text-match scores even when plots look correct.
  • Models may produce syntactically runnable code that renders visually incorrect plots.
  • Image resolution limits can hide fine-grained errors.

Core Entities

Models

  • GPT-4V
  • GPT-4
  • Gemini-Pro
  • Claude-3-Opus
  • Claude-3-Sonnet
  • Mini-Gemini-8x7B-HD
  • Mini-Gemini-34B-HD
  • Mini-Gemini-8x7B
  • Mini-Gemini-2B
  • DeepSeek-VL-7B
  • LLaVA-1.6-Mistral-7B
  • LLaVA-1.6-34B

Metrics

  • code pass rate
  • text-match ratio
  • GPT-4V overall rating
  • CLIP-Score
  • MSE
  • SSIM

Datasets

  • Plot2Code

Benchmarks

  • HumanEval
  • MBPP
  • MMCode
  • Design2Code