SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

February 18, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li

Links

Abstract / PDF

Why It Matters For Business

SimpleVQA surfaces concrete factual weaknesses in vision-capable LLMs, reducing risk if you rely on multimodal outputs for decisions; run it to find brittle visual facts before deployment.

Summary TLDR

SimpleVQA is a bilingual (Chinese/English) visual question-answering benchmark of 2,025 short, fact-seeking image Q&As designed to probe factuality in multimodal LLMs. It focuses on time-stable, verifiable facts across 9 tasks and 9 domains. The dataset was curated with LLM-assisted generation, multi-stage human verification, and difficulty filtering. The authors evaluate 18 multimodal LLMs and 8 text-only LLMs, show broad factual weaknesses (many models score in the 30–56% F-score range), and introduce atomic-fact probing to separate failures due to poor visual recognition from missing internal knowledge.

Problem Statement

Current benchmarks measure vision-language skills but miss short-form factuality that depends on both image understanding and stored knowledge. The field lacks a concise, verifiable VQA set that: (a) targets time-stable facts; (b) is bilingual; (c) isolates visual comprehension vs. internal knowledge failures.

Main Contribution

SimpleVQA: a curated bilingual VQA benchmark with 2,025 short fact questions and static reference answers.

Nine task categories and nine domain topics to cover varied factual queries (e.g., object ID, time/event, text processing).

A multi-stage pipeline: seed collection, GPT-4o-assisted refinement, LLM screening, multi-annotator human checks, and difficulty filtering.

An atomic-fact probing protocol that turns failures into smaller "atomic" questions to distinguish visual recognition vs. knowledge gaps.

A large-scale evaluation: 18 MLLMs and 8 text-only LLMs scored with an LLM-as-a-judge setup and human verification.

Key Findings

SimpleVQA contains 2,025 high-quality Q&A samples across 9 tasks and 9 domains.

NumbersDataset size = 2,025; 9 tasks; 9 domains (Table 2)

Most evaluated MLLMs show modest factual accuracy on this benchmark.

NumbersReported F-scores cluster roughly 30–56% across models; top closed-source model F≈56.3 (Gemini-2.0-flash) (Table 3)

Atomic-fact hints often boost final answer accuracy, separating visual vs. knowledge failures.

NumbersInternVL2.5-78B-MPO: Origin 55.36% → Atomic-Given 69.95% (+14.6%); Qwen2.5-VL-72B-Instruct: 51.67% → 64.15% (+12.48%); (

The dataset was purposely filtered for difficulty using other MLLMs.

NumbersInitial pool 8,360 Q&As → retained 2,025 (24% kept); removed 1,108 via multi-model testing (Section 2.4)

Annotation costs and verification were non-trivial but affordable.

NumbersAnnotation cost ≈ $5,202 for 2,025 Q&As (Appendix A)

Results

SimpleVQA dataset size

Value2,025 samples

Models evaluated

Value18 multimodal LLMs; 8 text-only LLMs

Top reported F-score (approx.)

Value≈56.3 (Gemini-2.0-flash on Chinese split)

Baselinemany models in 30–50% range

CFQ (atomic-given) improvement

ValueInternVL2.5-78B-MPO +14.59% (55.36→69.95)

BaselineOrigin CO = 55.36%

Who Should Care

What To Try In 7 Days

Run SimpleVQA on your model to get a quick factuality baseline.

Use atomic-fact probing: convert failing items into smaller visual facts and retest to locate failure cause.

Add focused SFT (supervised fine-tuning) data for visual labels found weak by SimpleVQA and re-evaluate performance delta.

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relatively small: 2,025 examples may not cover all visual facts and long-tail cases.
  • No public dataset URL or code in paper text; replicability may require contacting authors.
  • Benchmark intentionally filters out easy items, so raw accuracy may understate general competence on everyday VQA.
  • LLM-as-judge can introduce evaluator bias; they used GPT-4o which may align with models used in construction.

When Not To Use

  • Not ideal when you need broad-multimodal reasoning tasks or long-form answers.
  • Not suitable for political, safety, or time-sensitive evaluations (excluded by design).
  • Not a replacement for large-scale domain-specific factual audits due to size limits.

Failure Modes

  • Visual recognition errors (object/person ID, text-in-image)—detectable via atomic probes.
  • Missing internalized knowledge even when visual cue is provided (small atomic-given gains).
  • Overconfidence and hallucination despite low factual accuracy (calibration issues shown).

Core Entities

Models

  • GPT-4o
  • GPT-4o-mini
  • Gemini-2.0-flash
  • Claude-3.5-Sonnet
  • Qwen2.5-VL-72B-Instruct
  • InternVL2.5-78B-MPO
  • Doubao-vision-pro-128k
  • ERNIE-VL

Metrics

  • CO
  • NA
  • IN
  • CGA
  • F-score

Datasets

  • SimpleVQA
  • SimpleQA
  • Chinese SimpleQA
  • MMbench
  • MMVet
  • MME
  • Dynamath
  • MMbench_CN
  • CCBench

Benchmarks

  • SimpleVQA
  • SimpleQA
  • Chinese SimpleQA
  • MMbench
  • Dynamath

Context Entities

Models

  • Qwen-Max
  • InternVL2-Llama3-76B
  • Janus-pro-7B
  • DeepSeek-R1

Metrics

  • Accuracy
  • LLM-as-a-Judge

Datasets

  • MM-Vet
  • MMMU
  • MMMU-Pro

Benchmarks

  • ChineseFactEval
  • AGI-Eval
  • C-Eval