SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

February 18, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark is ready as a focused evaluation tool; results are supported by multi-model tables and human checks, but public code and dataset URLs are not provided in the paper.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li

Links

Abstract / PDF

Why It Matters For Business

SimpleVQA surfaces concrete factual weaknesses in vision-capable LLMs, reducing risk if you rely on multimodal outputs for decisions; run it to find brittle visual facts before deployment.

Who Should Care

Summary TLDR

SimpleVQA is a bilingual (Chinese/English) visual question-answering benchmark of 2,025 short, fact-seeking image Q&As designed to probe factuality in multimodal LLMs. It focuses on time-stable, verifiable facts across 9 tasks and 9 domains. The dataset was curated with LLM-assisted generation, multi-stage human verification, and difficulty filtering. The authors evaluate 18 multimodal LLMs and 8 text-only LLMs, show broad factual weaknesses (many models score in the 30–56% F-score range), and introduce atomic-fact probing to separate failures due to poor visual recognition from missing internal knowledge.

Problem Statement

Current benchmarks measure vision-language skills but miss short-form factuality that depends on both image understanding and stored knowledge. The field lacks a concise, verifiable VQA set that: (a) targets time-stable facts; (b) is bilingual; (c) isolates visual comprehension vs. internal knowledge failures.

Main Contribution

SimpleVQA: a curated bilingual VQA benchmark with 2,025 short fact questions and static reference answers.

Nine task categories and nine domain topics to cover varied factual queries (e.g., object ID, time/event, text processing).

Key Findings

SimpleVQA contains 2,025 high-quality Q&A samples across 9 tasks and 9 domains.

NumbersDataset size = 2,025; 9 tasks; 9 domains (Table 2)

Practical UseUse SimpleVQA to stress-test short, verifiable visual facts rather than broad multimodal reasoning.

Evidence RefSection 2.1, Table 2

Most evaluated MLLMs show modest factual accuracy on this benchmark.

NumbersReported F-scores cluster roughly 3056% across models; top closed-source model F≈56.3 (Gemini-2.0-flash) (Table 3)

Practical UseExpect large room for improvement; do not assume state-of-the-art MLLMs are reliably factual on image-grounded short questions.

Evidence RefSection 3.4, Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SimpleVQA dataset size2,025 samplesDataset constructed and reported in Sec 2 and Table 2Section 2.1, Table 2
Models evaluated18 multimodal LLMs; 8 text-only LLMsEvaluation setup described in Sec 3.2 and Appendix FSection 3.2, Appendix F

What To Try In 7 Days

Run SimpleVQA on your model to get a quick factuality baseline.

Use atomic-fact probing: convert failing items into smaller visual facts and retest to locate failure cause.

Add focused SFT (supervised fine-tuning) data for visual labels found weak by SimpleVQA and re-evaluate performance delta.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Relatively small: 2,025 examples may not cover all visual facts and long-tail cases.

No public dataset URL or code in paper text; replicability may require contacting authors.

When Not To Use

Not ideal when you need broad-multimodal reasoning tasks or long-form answers.

Not suitable for political, safety, or time-sensitive evaluations (excluded by design).

Failure Modes

Visual recognition errors (object/person ID, text-in-image)—detectable via atomic probes.

Missing internalized knowledge even when visual cue is provided (small atomic-given gains).

Core Entities

Models

GPT-4oGPT-4o-miniGemini-2.0-flashClaude-3.5-SonnetQwen2.5-VL-72B-InstructInternVL2.5-78B-MPODoubao-vision-pro-128kERNIE-VL

Metrics

CONAINCGAF-score

Datasets

SimpleVQASimpleQAChinese SimpleQAMMbenchMMVetMMEDynamathMMbench_CNCCBench

Benchmarks

SimpleVQASimpleQAChinese SimpleQAMMbenchDynamath

Context Entities

Models

Qwen-MaxInternVL2-Llama3-76BJanus-pro-7BDeepSeek-R1

Metrics

AccuracyLLM-as-a-Judge

Datasets

MM-VetMMMUMMMU-Pro

Benchmarks

ChineseFactEvalAGI-EvalC-Eval