Q-Bench: a focused benchmark that tests multimodal LLMs on low-level image perception, description, and human-aligned quality scoring

September 25, 20238 min

Overview

Decision SnapshotReady For Pilot

The benchmark is practical and directly usable for evaluation. Results convincingly show current MLLM capabilities and limits, but practical deployment needs task-specific fine-tuning and prompt engineering.

Citations20

Evidence Strength0.80

Confidence0.78

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 50%

Novelty: 45%

Authors

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MLLMs already detect many low-level image attributes and can produce human-correlated quality scores with a simple softmax trick; businesses can use them for scalable, early-stage quality triage and content moderation but should not replace specialist QA for fine-grained tasks.

Who Should Care

Summary TLDR

Q-Bench is a targeted benchmark that measures how multimodal large language models (MLLMs) handle low-level image tasks: (A1) perception (LLVisionQA, 2,990 images, multiple-choice), (A2) description (LLDescribe, 499 images, expert long descriptions), and (A3) quantitative image quality assessment (IQA) using a softmax pooling trick on top token logits. Evaluations (15 open-source MLLM variants + GPT-4V) show that MLLMs already capture basic low-level cues (many > random and some >60% accuracy on perception) but are unstable and imprecise on fine-grained judgments and detailed descriptions. A softmax-based extraction of logits improves numeric IQA correlations versus naive decoding. The paper

Problem Statement

Existing MLLM benchmarks focus on high-level vision tasks. There is no systematic, multimodal benchmark that tests whether MLLMs can perceive low-level image attributes (blur, noise, color, exposure), generate complete and precise low-level descriptions, and produce numeric quality scores aligned with human opinions. Q-Bench fills this gap with dedicated datasets and an evaluation pipeline.

Main Contribution

Q-Bench: a three-part benchmark to test MLLMs on low-level vision: perception, description, and quantitative assessment.

LLVisionQA: a 2,990-image perception dataset (multi-choice questions covering distortions, other attributes, global vs local, and three question types).

Key Findings

MLLMs show non-random perception ability but lag behind expert humans.

NumbersInternLM-XComposer-VL overall accuracy 64.35%; GPT-4V 73.36%; Senior human 81.74% (LLVisionQA test)

Practical UseMLLMs can already answer many low-level queries correctly. Expect to get workable but imperfect automated judgments; fine-tuning or task-specific prompts still needed to reach expert human level.

Evidence RefTable 2

Low-level descriptive outputs are incomplete and imprecise on average.

NumbersBest MLLM (InternLM) aggregate LLDescribe score 4.21/6; many models score ~3.03.9/6

Practical UseUse MLLMs for rough textual summaries of low-level image traits, but do not rely on them for full, precise reports; consider hybrid pipelines with specialized detectors for critical attributes.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyInternLM-XComposer-VL 64.35%; many top open-source models ≈6063%; GPT-4V 73.36%; Senior human 81.74%random guess 37.94%InternLM ≈ +26% over random; GPT-4V ≈ +35%LLVisionQA testTable 2 (Perception results)Table 2
Description aggregate score (completeness+precision+relevance)Best MLLM 4.21 / 6 (InternLM); many models ~3.03.9no baseline; gold descriptions by expertsLLDescribe (499)Table 3 (Description results)Table 3

What To Try In 7 Days

Run a small pilot: feed your image set to an open MLLM and extract IQA scores via the softmax-on-top-two-tokens trick to rank images quickly.

Use LLVisionQA-style multi-choice prompts to rapidly validate model behavior on your critical low-level attributes and spot yes/no bias.

Combine MLLM descriptions with a specialized detector (blur/noise extractor) and compare outputs on a 100-image sample to identify major failure modes.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

LLVisionQA contains 62% Yes questions, creating a yes-answer bias in evaluations; authors plan to add balanced reversed questions.

LLDescribe scoring relies on GPT as an automatic judge; GPT can hallucinate and introduce subjectivity despite 5-round voting.

When Not To Use

Do not use Q-Bench numbers as sole proof of human-level low-level vision; MLLMs are not yet expert-level for fine-grained IQA or professional image forensics.

Avoid relying on raw MLLM descriptions for critical decisions without verification from domain-specific detectors or human experts.

Failure Modes

Yes/no bias: models overpredict 'yes' on judgment queries, lowering reliability for negative detections.

Token-choice sensitivity: IQA numeric extraction depends on top tokens (good/poor vs high/low); wrong token pair reduces correlation.

Core Entities

Models

InternLM-XComposer-VLLLaVA-v1.5Qwen-VLLLaMA-Adapter-V2mPLUG-OwlInstructBLIPMiniGPT-4ShikraOtter-v1VisualGLM-6BKosmos-2GPT-4V

Metrics

AccuracySRCC (Spearman rank)PLCC (pearson)GPT-scored completeness/preciseness/relevance (0–2)

Datasets

LLVisionQA (2,990)LLDescribe (499)KONiQ-10kSPAQLIVE-FBLIVE-itwCGIQA-6KAGIQA-3KKADID-10K

Benchmarks

Q-BenchLLVisionQALLDescribeIQA evaluation (softmax strategy)