600k-chart instruction data + a human benchmark to improve multimodal chart QA

November 15, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

4

Authors

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu

Links

Abstract / PDF

Why It Matters For Business

Automate chart reading and QA by fine-tuning multimodal LLMs with domain-specific chart instructions; expect better classification and reasoning but not perfect numeric table extraction.

Summary TLDR

This paper releases MMC-Instruction, a 600k-instance instruction-tuning dataset for chart understanding, plus a 2k-item human-annotated MMC-Benchmark covering nine chart tasks. The authors fine-tune an LMM (MMCA) via a two-stage training recipe (chart-text alignment then LoRA-based instruction tuning) and show MMCA improves open-source LMM performance on chart QA and related tasks. Large gaps remain: GPT-4V still struggles on precise chart-to-table/json extraction and many models fail at OCR, layout reasoning, and following instructions.

Problem Statement

Current large multimodal models miss chart-specific skills (text layout, numeric extraction, chart reasoning). The paper aims to supply large, diverse training data and an evaluation benchmark to teach and measure chart understanding in LMMs.

Main Contribution

MMC-Instruction: a 600k-instance chart instruction-tuning corpus combining 210k chart-caption pairs, ~190k filtered public pairs, and 200k GPT-4-generated instruction examples.

MMC-Benchmark: a human-annotated benchmark (~2k questions/images) covering nine chart-focused tasks and two evaluation protocols (GPT-4 generation scoring and MQA multiple-choice).

MMCA: an instruction-tuned multimodal assistant (based on mPLUG-Owl + LoRA) trained with a two-stage regimen that outperforms prior open-source LMMs on chart tasks.

Key Findings

Large instruction corpus improves open-source LMMs on chart tasks.

NumbersMMCA overall free-form 0.26 vs prior open-source best 0.24 (Table 4)

MMCA raises multiple-choice (MQA) accuracy over baselines.

NumbersMMCA MQA overall 0.56 vs LLaVA1.5 0.51 (Table 5)

State-of-the-art GPT-4V still fails on precise numeric extraction tasks.

NumbersGPT-4V free-form Chart to Datatable 0.05, Chart to Json 0.04 (Table 4)

MMCA outperforms prior methods on public chart/document benchmarks.

NumbersMMCA ChartQA 57.4 vs Pix2Struct 56.0; DocVQA 72.5 vs Pix2Struct 72.1 (Table 6)

Vision encoder fine-tuning helps chart performance.

NumbersMMCA w/o FT vision: ChartQA 54.2 vs MMCA 57.4 (Table 7)

Common error modes: perception, language bias, and instruction-following failures.

NumbersGPT-4V errors: perception 39%, language bias 35%; open-source Not-Follow-Instructions 27%, weak vision 29.6% (Fig.4, Fig

Results

MMC-Benchmark overall (free-form, GPT-4 judged)

ValueMMCA 0.26, GPT-4V 0.51

Baselineopen-source LMMs ~0.17-0.24

MMC-Benchmark overall (MQA multiple-choice)

ValueMMCA 0.56, GPT-4V 0.76

BaselineLLaVA1.5 0.51

Chart to Datatable (free-form)

ValueMMCA 0.08, GPT-4V 0.05

Baselineothers 0.00-0.05

Chart to Json (free-form)

ValueMMCA 0.05, GPT-4V 0.04

Baselineothers 0.00-0.01

ChartQA (public benchmark)

ValueMMCA 57.4

BaselinePix2Struct 56.0

Vision encoder fine-tuning ablation

Valuew/o FT vision ChartQA 54.2 vs MMCA 57.4

BaselineMMCA with FT 57.4

Who Should Care

What To Try In 7 Days

Run MMCA (or fine-tune an LMM with MMC-Instruction) on a small set of your company charts to measure gains on classification and reasoning.

Add a verification OCR stage for numeric extraction before trusting model outputs for BI dashboards.

Use the MMC-Benchmark tasks and MQA protocol to baseline current tools on your chart types.

Optimization Features

Training Optimization

  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Instruction data partly generated by GPT-4 and can contain errors or hallucinations (authors report ~85% outputs acceptable).
  • Chart-to-datatable and chart-to-json extraction remain low-accuracy tasks even for top models.
  • Experiments use a 7B-model backbone; results may change with larger models or different compute.

When Not To Use

  • When you need exact, lossless extraction of all numeric values from charts.
  • When legal or privacy rules forbid sharing chart images with third-party models.
  • If you need a turnkey solution for production-grade OCR without verification.

Failure Modes

  • Vision perception error—misreading plot elements or values.
  • Language bias—model answers from prior knowledge not chart evidence.
  • Not following instructions—open-source LMMs sometimes ignore prompts.
  • OCR/missing-value failure—single missing numeric makes table extraction incorrect.

Core Entities

Models

  • MMCA
  • mPLUG-Owl
  • GPT-4V
  • LLaVA1.5
  • MiniGPT-v2
  • LRV-Instruction
  • Pix2Struct
  • Donut
  • BLIP-2
  • InstructBLIP
  • Shikra
  • Vicuna

Metrics

  • free-form correctness (GPT-4 scoring)
  • Accuracy

Datasets

  • MMC-Instruction
  • MMC-Benchmark
  • ChartQA
  • PlotQA
  • DVQA
  • FigureQA
  • SciGraphQA
  • Statista
  • VisText
  • ChartInfo
  • Unichart
  • DocVQA
  • TextVQA

Benchmarks

  • MMC-Benchmark
  • ChartQA
  • DocVQA
  • TextVQA

Context Entities

Models

  • GPT-4 (text-only for data generation)
  • gpt-4-32k-0314 (GPT-4 used for eval prompts)

Datasets

  • arXiv Scientific Chart-Caption corpus
  • Public chart datasets used for augmentation (Statista, PlotQA, VisText, ChartInfo, Unichart)