Q-Bench: a focused benchmark that tests multimodal LLMs on low-level image perception, description, and human-aligned quality scoring

September 25, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.45

Cost Impact Score

0.3

Citation Count

20

Authors

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin

Links

Abstract / PDF

Why It Matters For Business

MLLMs already detect many low-level image attributes and can produce human-correlated quality scores with a simple softmax trick; businesses can use them for scalable, early-stage quality triage and content moderation but should not replace specialist QA for fine-grained tasks.

Summary TLDR

Q-Bench is a targeted benchmark that measures how multimodal large language models (MLLMs) handle low-level image tasks: (A1) perception (LLVisionQA, 2,990 images, multiple-choice), (A2) description (LLDescribe, 499 images, expert long descriptions), and (A3) quantitative image quality assessment (IQA) using a softmax pooling trick on top token logits. Evaluations (15 open-source MLLM variants + GPT-4V) show that MLLMs already capture basic low-level cues (many > random and some >60% accuracy on perception) but are unstable and imprecise on fine-grained judgments and detailed descriptions. A softmax-based extraction of logits improves numeric IQA correlations versus naive decoding. The paper

Problem Statement

Existing MLLM benchmarks focus on high-level vision tasks. There is no systematic, multimodal benchmark that tests whether MLLMs can perceive low-level image attributes (blur, noise, color, exposure), generate complete and precise low-level descriptions, and produce numeric quality scores aligned with human opinions. Q-Bench fills this gap with dedicated datasets and an evaluation pipeline.

Main Contribution

Q-Bench: a three-part benchmark to test MLLMs on low-level vision: perception, description, and quantitative assessment.

LLVisionQA: a 2,990-image perception dataset (multi-choice questions covering distortions, other attributes, global vs local, and three question types).

LLDescribe and IQA protocol: 499 expert-written low-level descriptions plus a GPT-assisted three-dimension scoring process and a softmax-logit strategy to extract numeric quality scores.

Key Findings

MLLMs show non-random perception ability but lag behind expert humans.

NumbersInternLM-XComposer-VL overall accuracy 64.35%; GPT-4V 73.36%; Senior human 81.74% (LLVisionQA test)

Low-level descriptive outputs are incomplete and imprecise on average.

NumbersBest MLLM (InternLM) aggregate LLDescribe score 4.21/6; many models score ~3.0–3.9/6

Softmax pooling of the two top token logits yields much stronger numeric IQA correlation than argmax decoding.

NumbersLLaVA-v1 SRCC on KONiQ-10k: argmax 0.038 → softmax 0.462; similar gains across models/datasets

MLLMs exhibit a systematic 'yes' bias on yes/no perception queries.

NumbersSome models: yes accuracy 88.65% vs no accuracy 13.09% (IDEFICS-Instruct)

Results

Accuracy

ValueInternLM-XComposer-VL 64.35%; many top open-source models ≈60–63%; GPT-4V 73.36%; Senior human 81.74%

Baselinerandom guess 37.94%

Description aggregate score (completeness+precision+relevance)

ValueBest MLLM 4.21 / 6 (InternLM); many models ~3.0–3.9

Baselineno baseline; gold descriptions by experts

IQA correlation (SRCC/PLCC average)

ValueInternLM-XComposer-VL avg SRCC/PLCC 0.541 / 0.581

BaselineCLIP-ViT-Large-14 avg 0.354 / 0.368; NIQE avg 0.387 / 0.398

Softmax vs Argmax for IQA

ValueExample: LLaVA-v1 SRCC on KONiQ-10k argmax 0.038 → softmax 0.462

Baselineargmax token decoding

Who Should Care

What To Try In 7 Days

Run a small pilot: feed your image set to an open MLLM and extract IQA scores via the softmax-on-top-two-tokens trick to rank images quickly.

Use LLVisionQA-style multi-choice prompts to rapidly validate model behavior on your critical low-level attributes and spot yes/no bias.

Combine MLLM descriptions with a specialized detector (blur/noise extractor) and compare outputs on a 100-image sample to identify major failure modes.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLVisionQA contains 62% Yes questions, creating a yes-answer bias in evaluations; authors plan to add balanced reversed questions.
  • LLDescribe scoring relies on GPT as an automatic judge; GPT can hallucinate and introduce subjectivity despite 5-round voting.
  • Some evaluation slices are private (test subset) to prevent data contamination; that limits full external reproducibility now.

When Not To Use

  • Do not use Q-Bench numbers as sole proof of human-level low-level vision; MLLMs are not yet expert-level for fine-grained IQA or professional image forensics.
  • Avoid relying on raw MLLM descriptions for critical decisions without verification from domain-specific detectors or human experts.

Failure Modes

  • Yes/no bias: models overpredict 'yes' on judgment queries, lowering reliability for negative detections.
  • Token-choice sensitivity: IQA numeric extraction depends on top tokens (good/poor vs high/low); wrong token pair reduces correlation.
  • Prompt failure on some models (Kosmos-2) where models append new options instead of selecting—requires prompt engineering or close-set perplexity ranking.

Core Entities

Models

  • InternLM-XComposer-VL
  • LLaVA-v1.5
  • Qwen-VL
  • LLaMA-Adapter-V2
  • mPLUG-Owl
  • InstructBLIP
  • MiniGPT-4
  • Shikra
  • Otter-v1
  • VisualGLM-6B
  • Kosmos-2
  • GPT-4V

Metrics

  • Accuracy
  • SRCC (Spearman rank)
  • PLCC (pearson)
  • GPT-scored completeness/preciseness/relevance (0–2)

Datasets

  • LLVisionQA (2,990)
  • LLDescribe (499)
  • KONiQ-10k
  • SPAQ
  • LIVE-FB
  • LIVE-itw
  • CGIQA-6K
  • AGIQA-3K
  • KADID-10K

Benchmarks

  • Q-Bench
  • LLVisionQA
  • LLDescribe
  • IQA evaluation (softmax strategy)