Overview
The benchmark is practical and directly usable for evaluation. Results convincingly show current MLLM capabilities and limits, but practical deployment needs task-specific fine-tuning and prompt engineering.
Citations20
Evidence Strength0.80
Confidence0.78
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 50%
Novelty: 45%
Why It Matters For Business
MLLMs already detect many low-level image attributes and can produce human-correlated quality scores with a simple softmax trick; businesses can use them for scalable, early-stage quality triage and content moderation but should not replace specialist QA for fine-grained tasks.
Who Should Care
Summary TLDR
Q-Bench is a targeted benchmark that measures how multimodal large language models (MLLMs) handle low-level image tasks: (A1) perception (LLVisionQA, 2,990 images, multiple-choice), (A2) description (LLDescribe, 499 images, expert long descriptions), and (A3) quantitative image quality assessment (IQA) using a softmax pooling trick on top token logits. Evaluations (15 open-source MLLM variants + GPT-4V) show that MLLMs already capture basic low-level cues (many > random and some >60% accuracy on perception) but are unstable and imprecise on fine-grained judgments and detailed descriptions. A softmax-based extraction of logits improves numeric IQA correlations versus naive decoding. The paper
Problem Statement
Existing MLLM benchmarks focus on high-level vision tasks. There is no systematic, multimodal benchmark that tests whether MLLMs can perceive low-level image attributes (blur, noise, color, exposure), generate complete and precise low-level descriptions, and produce numeric quality scores aligned with human opinions. Q-Bench fills this gap with dedicated datasets and an evaluation pipeline.
Main Contribution
Q-Bench: a three-part benchmark to test MLLMs on low-level vision: perception, description, and quantitative assessment.
LLVisionQA: a 2,990-image perception dataset (multi-choice questions covering distortions, other attributes, global vs local, and three question types).
Key Findings
MLLMs show non-random perception ability but lag behind expert humans.
Low-level descriptive outputs are incomplete and imprecise on average.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | InternLM-XComposer-VL 64.35%; many top open-source models ≈60–63%; GPT-4V 73.36%; Senior human 81.74% | random guess 37.94% | InternLM ≈ +26% over random; GPT-4V ≈ +35% | LLVisionQA test | Table 2 (Perception results) | Table 2 |
| Description aggregate score (completeness+precision+relevance) | Best MLLM 4.21 / 6 (InternLM); many models ~3.0–3.9 | no baseline; gold descriptions by experts | — | LLDescribe (499) | Table 3 (Description results) | Table 3 |
What To Try In 7 Days
Run a small pilot: feed your image set to an open MLLM and extract IQA scores via the softmax-on-top-two-tokens trick to rank images quickly.
Use LLVisionQA-style multi-choice prompts to rapidly validate model behavior on your critical low-level attributes and spot yes/no bias.
Combine MLLM descriptions with a specialized detector (blur/noise extractor) and compare outputs on a 100-image sample to identify major failure modes.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
LLVisionQA contains 62% Yes questions, creating a yes-answer bias in evaluations; authors plan to add balanced reversed questions.
LLDescribe scoring relies on GPT as an automatic judge; GPT can hallucinate and introduce subjectivity despite 5-round voting.
When Not To Use
Do not use Q-Bench numbers as sole proof of human-level low-level vision; MLLMs are not yet expert-level for fine-grained IQA or professional image forensics.
Avoid relying on raw MLLM descriptions for critical decisions without verification from domain-specific detectors or human experts.
Failure Modes
Yes/no bias: models overpredict 'yes' on judgment queries, lowering reliability for negative detections.
Token-choice sensitivity: IQA numeric extraction depends on top tokens (good/poor vs high/low); wrong token pair reduces correlation.

