600k-chart instruction data + a human benchmark to improve multimodal chart QA

November 15, 20237 min

Overview

Decision SnapshotReady For Pilot

The dataset and benchmark provide clear gains for chart reasoning and classification, but numeric extraction remains unreliable; expect to need OCR and verification for production.

Citations4

Evidence Strength0.78

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automate chart reading and QA by fine-tuning multimodal LLMs with domain-specific chart instructions; expect better classification and reasoning but not perfect numeric table extraction.

Who Should Care

Summary TLDR

This paper releases MMC-Instruction, a 600k-instance instruction-tuning dataset for chart understanding, plus a 2k-item human-annotated MMC-Benchmark covering nine chart tasks. The authors fine-tune an LMM (MMCA) via a two-stage training recipe (chart-text alignment then LoRA-based instruction tuning) and show MMCA improves open-source LMM performance on chart QA and related tasks. Large gaps remain: GPT-4V still struggles on precise chart-to-table/json extraction and many models fail at OCR, layout reasoning, and following instructions.

Problem Statement

Current large multimodal models miss chart-specific skills (text layout, numeric extraction, chart reasoning). The paper aims to supply large, diverse training data and an evaluation benchmark to teach and measure chart understanding in LMMs.

Main Contribution

MMC-Instruction: a 600k-instance chart instruction-tuning corpus combining 210k chart-caption pairs, ~190k filtered public pairs, and 200k GPT-4-generated instruction examples.

MMC-Benchmark: a human-annotated benchmark (~2k questions/images) covering nine chart-focused tasks and two evaluation protocols (GPT-4 generation scoring and MQA multiple-choice).

Key Findings

Large instruction corpus improves open-source LMMs on chart tasks.

NumbersMMCA overall free-form 0.26 vs prior open-source best 0.24 (Table 4)

Practical UseFine-tuning LMMs with diverse chart instruction data yields measurable gains; try instruction-tuning on domain charts to raise accuracy.

Evidence RefTable 4

MMCA raises multiple-choice (MQA) accuracy over baselines.

NumbersMMCA MQA overall 0.56 vs LLaVA1.5 0.51 (Table 5)

Practical UseFor classification or multiple-choice chart tasks, instruction-tuned LMMs give a meaningful lift—use MQA-style evaluation for quick benchmarking.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MMC-Benchmark overall (free-form, GPT-4 judged)MMCA 0.26, GPT-4V 0.51open-source LMMs ~0.17-0.24MMCA +~0.02 over best open-source baselineMMC-Benchmark (free-form)Table 4: overall scores from free-form GPT-4 evaluationTable 4
MMC-Benchmark overall (MQA multiple-choice)MMCA 0.56, GPT-4V 0.76LLaVA1.5 0.51MMCA +0.05 over LLaVA1.5MMC-Benchmark (MQA)Table 5: MQA accuracyTable 5

What To Try In 7 Days

Run MMCA (or fine-tune an LMM with MMC-Instruction) on a small set of your company charts to measure gains on classification and reasoning.

Add a verification OCR stage for numeric extraction before trusting model outputs for BI dashboards.

Use the MMC-Benchmark tasks and MQA protocol to baseline current tools on your chart types.

Optimization Features

Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Instruction data partly generated by GPT-4 and can contain errors or hallucinations (authors report ~85% outputs acceptable).

Chart-to-datatable and chart-to-json extraction remain low-accuracy tasks even for top models.

When Not To Use

When you need exact, lossless extraction of all numeric values from charts.

When legal or privacy rules forbid sharing chart images with third-party models.

Failure Modes

Vision perception error—misreading plot elements or values.

Language bias—model answers from prior knowledge not chart evidence.

Core Entities

Models

MMCAmPLUG-OwlGPT-4VLLaVA1.5MiniGPT-v2LRV-InstructionPix2StructDonutBLIP-2InstructBLIPShikraVicuna

Metrics

free-form correctness (GPT-4 scoring)Accuracy

Datasets

MMC-InstructionMMC-BenchmarkChartQAPlotQADVQAFigureQASciGraphQAStatistaVisTextChartInfoUnichartDocVQATextVQA

Benchmarks

MMC-BenchmarkChartQADocVQATextVQA

Context Entities

Models

GPT-4 (text-only for data generation)gpt-4-32k-0314 (GPT-4 used for eval prompts)

Datasets

arXiv Scientific Chart-Caption corpusPublic chart datasets used for augmentation (Statista, PlotQA, VisText, ChartInfo, Unichart)