Q-Bench: a focused benchmark that tests multimodal LLMs on low-level image perception, description, and human-aligned quality scoring

Overview

Decision SnapshotReady For Pilot

The benchmark is practical and directly usable for evaluation. Results convincingly show current MLLM capabilities and limits, but practical deployment needs task-specific fine-tuning and prompt engineering.

Citations20

Evidence Strength0.80

Confidence0.78

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 50%

Novelty: 45%

Authors

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MLLMs already detect many low-level image attributes and can produce human-correlated quality scores with a simple softmax trick; businesses can use them for scalable, early-stage quality triage and content moderation but should not replace specialist QA for fine-grained tasks.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

Q-Bench is a targeted benchmark that measures how multimodal large language models (MLLMs) handle low-level image tasks: (A1) perception (LLVisionQA, 2,990 images, multiple-choice), (A2) description (LLDescribe, 499 images, expert long descriptions), and (A3) quantitative image quality assessment (IQA) using a softmax pooling trick on top token logits. Evaluations (15 open-source MLLM variants + GPT-4V) show that MLLMs already capture basic low-level cues (many > random and some >60% accuracy on perception) but are unstable and imprecise on fine-grained judgments and detailed descriptions. A softmax-based extraction of logits improves numeric IQA correlations versus naive decoding. The paper

Problem Statement

Existing MLLM benchmarks focus on high-level vision tasks. There is no systematic, multimodal benchmark that tests whether MLLMs can perceive low-level image attributes (blur, noise, color, exposure), generate complete and precise low-level descriptions, and produce numeric quality scores aligned with human opinions. Q-Bench fills this gap with dedicated datasets and an evaluation pipeline.

Main Contribution

Q-Bench: a three-part benchmark to test MLLMs on low-level vision: perception, description, and quantitative assessment.

LLVisionQA: a 2,990-image perception dataset (multi-choice questions covering distortions, other attributes, global vs local, and three question types).

Key Findings

MLLMs show non-random perception ability but lag behind expert humans.

NumbersInternLM-XComposer-VL overall accuracy 64.35%; GPT-4V 73.36%; Senior human 81.74% (LLVisionQA test)

Practical UseMLLMs can already answer many low-level queries correctly. Expect to get workable but imperfect automated judgments; fine-tuning or task-specific prompts still needed to reach expert human level.

Evidence RefTable 2

Low-level descriptive outputs are incomplete and imprecise on average.

NumbersBest MLLM (InternLM) aggregate LLDescribe score 4.21/6; many models score ~3.0–3.9/6

Practical UseUse MLLMs for rough textual summaries of low-level image traits, but do not rely on them for full, precise reports; consider hybrid pipelines with specialized detectors for critical attributes.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	InternLM-XComposer-VL 64.35%; many top open-source models ≈60–63%; GPT-4V 73.36%; Senior human 81.74%	random guess 37.94%	InternLM ≈ +26% over random; GPT-4V ≈ +35%	LLVisionQA test	Table 2 (Perception results)	Table 2
Description aggregate score (completeness+precision+relevance)	Best MLLM 4.21 / 6 (InternLM); many models ~3.0–3.9	no baseline; gold descriptions by experts	—	LLDescribe (499)	Table 3 (Description results)	Table 3

What To Try In 7 Days

Run a small pilot: feed your image set to an open MLLM and extract IQA scores via the softmax-on-top-two-tokens trick to rank images quickly.

Use LLVisionQA-style multi-choice prompts to rapidly validate model behavior on your critical low-level attributes and spot yes/no bias.

Combine MLLM descriptions with a specialized detector (blur/noise extractor) and compare outputs on a 100-image sample to identify major failure modes.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://q-future.github.io/Q-Bench

Data URLs

https://q-future.github.io/Q-Bench

Risks & Boundaries

Limitations

LLVisionQA contains 62% Yes questions, creating a yes-answer bias in evaluations; authors plan to add balanced reversed questions.

LLDescribe scoring relies on GPT as an automatic judge; GPT can hallucinate and introduce subjectivity despite 5-round voting.

When Not To Use

Do not use Q-Bench numbers as sole proof of human-level low-level vision; MLLMs are not yet expert-level for fine-grained IQA or professional image forensics.

Avoid relying on raw MLLM descriptions for critical decisions without verification from domain-specific detectors or human experts.

Failure Modes

Yes/no bias: models overpredict 'yes' on judgment queries, lowering reliability for negative detections.

Token-choice sensitivity: IQA numeric extraction depends on top tokens (good/poor vs high/low); wrong token pair reduces correlation.

Core Entities

Models

InternLM-XComposer-VLLLaVA-v1.5Qwen-VLLLaMA-Adapter-V2mPLUG-OwlInstructBLIPMiniGPT-4ShikraOtter-v1VisualGLM-6BKosmos-2GPT-4V

Metrics

AccuracySRCC (Spearman rank)PLCC (pearson)GPT-scored completeness/preciseness/relevance (0–2)

Datasets

LLVisionQA (2,990)LLDescribe (499)KONiQ-10kSPAQLIVE-FBLIVE-itwCGIQA-6KAGIQA-3KKADID-10K

Benchmarks

Q-BenchLLVisionQALLDescribeIQA evaluation (softmax strategy)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MLLMs show non-random perception ability but lag behind expert humans.

Low-level descriptive outputs are incomplete and imprecise on average.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding