SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Overview

Decision SnapshotNeeds Validation

The benchmark is ready as a focused evaluation tool; results are supported by multi-model tables and human checks, but public code and dataset URLs are not provided in the paper.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li

Links

Abstract / PDF

Why It Matters For Business

SimpleVQA surfaces concrete factual weaknesses in vision-capable LLMs, reducing risk if you rely on multimodal outputs for decisions; run it to find brittle visual facts before deployment.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

SimpleVQA is a bilingual (Chinese/English) visual question-answering benchmark of 2,025 short, fact-seeking image Q&As designed to probe factuality in multimodal LLMs. It focuses on time-stable, verifiable facts across 9 tasks and 9 domains. The dataset was curated with LLM-assisted generation, multi-stage human verification, and difficulty filtering. The authors evaluate 18 multimodal LLMs and 8 text-only LLMs, show broad factual weaknesses (many models score in the 30–56% F-score range), and introduce atomic-fact probing to separate failures due to poor visual recognition from missing internal knowledge.

Problem Statement

Current benchmarks measure vision-language skills but miss short-form factuality that depends on both image understanding and stored knowledge. The field lacks a concise, verifiable VQA set that: (a) targets time-stable facts; (b) is bilingual; (c) isolates visual comprehension vs. internal knowledge failures.

Main Contribution

SimpleVQA: a curated bilingual VQA benchmark with 2,025 short fact questions and static reference answers.

Nine task categories and nine domain topics to cover varied factual queries (e.g., object ID, time/event, text processing).

Key Findings

SimpleVQA contains 2,025 high-quality Q&A samples across 9 tasks and 9 domains.

NumbersDataset size = 2,025; 9 tasks; 9 domains (Table 2)

Practical UseUse SimpleVQA to stress-test short, verifiable visual facts rather than broad multimodal reasoning.

Evidence RefSection 2.1, Table 2

Most evaluated MLLMs show modest factual accuracy on this benchmark.

NumbersReported F-scores cluster roughly 30–56% across models; top closed-source model F≈56.3 (Gemini-2.0-flash) (Table 3)

Practical UseExpect large room for improvement; do not assume state-of-the-art MLLMs are reliably factual on image-grounded short questions.

Evidence RefSection 3.4, Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SimpleVQA dataset size	2,025 samples	—	—	—	Dataset constructed and reported in Sec 2 and Table 2	Section 2.1, Table 2
Models evaluated	18 multimodal LLMs; 8 text-only LLMs	—	—	—	Evaluation setup described in Sec 3.2 and Appendix F	Section 3.2, Appendix F

What To Try In 7 Days

Run SimpleVQA on your model to get a quick factuality baseline.

Use atomic-fact probing: convert failing items into smaller visual facts and retest to locate failure cause.

Add focused SFT (supervised fine-tuning) data for visual labels found weak by SimpleVQA and re-evaluate performance delta.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relatively small: 2,025 examples may not cover all visual facts and long-tail cases.

No public dataset URL or code in paper text; replicability may require contacting authors.

When Not To Use

Not ideal when you need broad-multimodal reasoning tasks or long-form answers.

Not suitable for political, safety, or time-sensitive evaluations (excluded by design).

Failure Modes

Visual recognition errors (object/person ID, text-in-image)—detectable via atomic probes.

Missing internalized knowledge even when visual cue is provided (small atomic-given gains).

Core Entities

Models

GPT-4oGPT-4o-miniGemini-2.0-flashClaude-3.5-SonnetQwen2.5-VL-72B-InstructInternVL2.5-78B-MPODoubao-vision-pro-128kERNIE-VL

Metrics

CONAINCGAF-score

Datasets

SimpleVQASimpleQAChinese SimpleQAMMbenchMMVetMMEDynamathMMbench_CNCCBench

Benchmarks

SimpleVQASimpleQAChinese SimpleQAMMbenchDynamath

Context Entities

Models

Qwen-MaxInternVL2-Llama3-76BJanus-pro-7BDeepSeek-R1

Metrics

AccuracyLLM-as-a-Judge

Datasets

MM-VetMMMUMMMU-Pro

Benchmarks

ChineseFactEvalAGI-EvalC-Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SimpleVQA contains 2,025 high-quality Q&A samples across 9 tasks and 9 domains.

Most evaluated MLLMs show modest factual accuracy on this benchmark.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-­

Key finding

CCFQA: parallel speech+text QA in 8 languages to measure cross-lingual and cross-modal factual consistency

Key finding

VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Key finding

Train on 1K text rationales to build a judge that scores images, audio, video and molecules zero-shot

Key finding

M-JudgeBench: a capability-focused multimodal judge benchmark plus Judge‑MCTS data that boosts judge model accuracy with a small synthetic-