Overview
Production Readiness
0.5
Novelty Score
0.45
Cost Impact Score
0.3
Citation Count
20
Why It Matters For Business
MLLMs already detect many low-level image attributes and can produce human-correlated quality scores with a simple softmax trick; businesses can use them for scalable, early-stage quality triage and content moderation but should not replace specialist QA for fine-grained tasks.
Summary TLDR
Q-Bench is a targeted benchmark that measures how multimodal large language models (MLLMs) handle low-level image tasks: (A1) perception (LLVisionQA, 2,990 images, multiple-choice), (A2) description (LLDescribe, 499 images, expert long descriptions), and (A3) quantitative image quality assessment (IQA) using a softmax pooling trick on top token logits. Evaluations (15 open-source MLLM variants + GPT-4V) show that MLLMs already capture basic low-level cues (many > random and some >60% accuracy on perception) but are unstable and imprecise on fine-grained judgments and detailed descriptions. A softmax-based extraction of logits improves numeric IQA correlations versus naive decoding. The paper
Problem Statement
Existing MLLM benchmarks focus on high-level vision tasks. There is no systematic, multimodal benchmark that tests whether MLLMs can perceive low-level image attributes (blur, noise, color, exposure), generate complete and precise low-level descriptions, and produce numeric quality scores aligned with human opinions. Q-Bench fills this gap with dedicated datasets and an evaluation pipeline.
Main Contribution
Q-Bench: a three-part benchmark to test MLLMs on low-level vision: perception, description, and quantitative assessment.
LLVisionQA: a 2,990-image perception dataset (multi-choice questions covering distortions, other attributes, global vs local, and three question types).
LLDescribe and IQA protocol: 499 expert-written low-level descriptions plus a GPT-assisted three-dimension scoring process and a softmax-logit strategy to extract numeric quality scores.
Key Findings
MLLMs show non-random perception ability but lag behind expert humans.
Low-level descriptive outputs are incomplete and imprecise on average.
Softmax pooling of the two top token logits yields much stronger numeric IQA correlation than argmax decoding.
MLLMs exhibit a systematic 'yes' bias on yes/no perception queries.
Results
Accuracy
Description aggregate score (completeness+precision+relevance)
IQA correlation (SRCC/PLCC average)
Softmax vs Argmax for IQA
Who Should Care
What To Try In 7 Days
Run a small pilot: feed your image set to an open MLLM and extract IQA scores via the softmax-on-top-two-tokens trick to rank images quickly.
Use LLVisionQA-style multi-choice prompts to rapidly validate model behavior on your critical low-level attributes and spot yes/no bias.
Combine MLLM descriptions with a specialized detector (blur/noise extractor) and compare outputs on a 100-image sample to identify major failure modes.
Reproducibility
Code Urls
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLVisionQA contains 62% Yes questions, creating a yes-answer bias in evaluations; authors plan to add balanced reversed questions.
- LLDescribe scoring relies on GPT as an automatic judge; GPT can hallucinate and introduce subjectivity despite 5-round voting.
- Some evaluation slices are private (test subset) to prevent data contamination; that limits full external reproducibility now.
When Not To Use
- Do not use Q-Bench numbers as sole proof of human-level low-level vision; MLLMs are not yet expert-level for fine-grained IQA or professional image forensics.
- Avoid relying on raw MLLM descriptions for critical decisions without verification from domain-specific detectors or human experts.
Failure Modes
- Yes/no bias: models overpredict 'yes' on judgment queries, lowering reliability for negative detections.
- Token-choice sensitivity: IQA numeric extraction depends on top tokens (good/poor vs high/low); wrong token pair reduces correlation.
- Prompt failure on some models (Kosmos-2) where models append new options instead of selecting—requires prompt engineering or close-set perplexity ranking.
Core Entities
Models
- InternLM-XComposer-VL
- LLaVA-v1.5
- Qwen-VL
- LLaMA-Adapter-V2
- mPLUG-Owl
- InstructBLIP
- MiniGPT-4
- Shikra
- Otter-v1
- VisualGLM-6B
- Kosmos-2
- GPT-4V
Metrics
- Accuracy
- SRCC (Spearman rank)
- PLCC (pearson)
- GPT-scored completeness/preciseness/relevance (0–2)
Datasets
- LLVisionQA (2,990)
- LLDescribe (499)
- KONiQ-10k
- SPAQ
- LIVE-FB
- LIVE-itw
- CGIQA-6K
- AGIQA-3K
- KADID-10K
Benchmarks
- Q-Bench
- LLVisionQA
- LLDescribe
- IQA evaluation (softmax strategy)

