600k-chart instruction data + a human benchmark to improve multimodal chart QA

Overview

Decision SnapshotReady For Pilot

The dataset and benchmark provide clear gains for chart reasoning and classification, but numeric extraction remains unreliable; expect to need OCR and verification for production.

Citations4

Evidence Strength0.78

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 60%

Authors

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automate chart reading and QA by fine-tuning multimodal LLMs with domain-specific chart instructions; expect better classification and reasoning but not perfect numeric table extraction.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

This paper releases MMC-Instruction, a 600k-instance instruction-tuning dataset for chart understanding, plus a 2k-item human-annotated MMC-Benchmark covering nine chart tasks. The authors fine-tune an LMM (MMCA) via a two-stage training recipe (chart-text alignment then LoRA-based instruction tuning) and show MMCA improves open-source LMM performance on chart QA and related tasks. Large gaps remain: GPT-4V still struggles on precise chart-to-table/json extraction and many models fail at OCR, layout reasoning, and following instructions.

Problem Statement

Current large multimodal models miss chart-specific skills (text layout, numeric extraction, chart reasoning). The paper aims to supply large, diverse training data and an evaluation benchmark to teach and measure chart understanding in LMMs.

Main Contribution

MMC-Instruction: a 600k-instance chart instruction-tuning corpus combining 210k chart-caption pairs, ~190k filtered public pairs, and 200k GPT-4-generated instruction examples.

MMC-Benchmark: a human-annotated benchmark (~2k questions/images) covering nine chart-focused tasks and two evaluation protocols (GPT-4 generation scoring and MQA multiple-choice).

Key Findings

Large instruction corpus improves open-source LMMs on chart tasks.

NumbersMMCA overall free-form 0.26 vs prior open-source best 0.24 (Table 4)

Practical UseFine-tuning LMMs with diverse chart instruction data yields measurable gains; try instruction-tuning on domain charts to raise accuracy.

Evidence RefTable 4

MMCA raises multiple-choice (MQA) accuracy over baselines.

NumbersMMCA MQA overall 0.56 vs LLaVA1.5 0.51 (Table 5)

Practical UseFor classification or multiple-choice chart tasks, instruction-tuned LMMs give a meaningful lift—use MQA-style evaluation for quick benchmarking.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MMC-Benchmark overall (free-form, GPT-4 judged)	MMCA 0.26, GPT-4V 0.51	open-source LMMs ~0.17-0.24	MMCA +~0.02 over best open-source baseline	MMC-Benchmark (free-form)	Table 4: overall scores from free-form GPT-4 evaluation	Table 4
MMC-Benchmark overall (MQA multiple-choice)	MMCA 0.56, GPT-4V 0.76	LLaVA1.5 0.51	MMCA +0.05 over LLaVA1.5	MMC-Benchmark (MQA)	Table 5: MQA accuracy	Table 5

What To Try In 7 Days

Run MMCA (or fine-tune an LMM with MMC-Instruction) on a small set of your company charts to measure gains on classification and reasoning.

Add a verification OCR stage for numeric extraction before trusting model outputs for BI dashboards.

Use the MMC-Benchmark tasks and MQA protocol to baseline current tools on your chart types.

Optimization Features

Training Optimization

LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FuxiaoLiu/MMC

Data URLs

https://github.com/FuxiaoLiu/MMC

Risks & Boundaries

Limitations

Instruction data partly generated by GPT-4 and can contain errors or hallucinations (authors report ~85% outputs acceptable).

Chart-to-datatable and chart-to-json extraction remain low-accuracy tasks even for top models.

When Not To Use

When you need exact, lossless extraction of all numeric values from charts.

When legal or privacy rules forbid sharing chart images with third-party models.

Failure Modes

Vision perception error—misreading plot elements or values.

Language bias—model answers from prior knowledge not chart evidence.

Core Entities

Models

MMCAmPLUG-OwlGPT-4VLLaVA1.5MiniGPT-v2LRV-InstructionPix2StructDonutBLIP-2InstructBLIPShikraVicuna

Metrics

free-form correctness (GPT-4 scoring)Accuracy

Datasets

MMC-InstructionMMC-BenchmarkChartQAPlotQADVQAFigureQASciGraphQAStatistaVisTextChartInfoUnichartDocVQATextVQA

Benchmarks

MMC-BenchmarkChartQADocVQATextVQA

Context Entities

Models

GPT-4 (text-only for data generation)gpt-4-32k-0314 (GPT-4 used for eval prompts)

Datasets

arXiv Scientific Chart-Caption corpusPublic chart datasets used for augmentation (Statista, PlotQA, VisText, ChartInfo, Unichart)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large instruction corpus improves open-source LMMs on chart tasks.

MMCA raises multiple-choice (MQA) accuracy over baselines.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-