Overview
The benchmark and LLM-driven metric are ready for adoption in evaluation pipelines, backed by strong human correlation, but require running an LLM judge and careful prompt design.
Citations0
Evidence Strength0.80
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
VALOR-EVAL helps detect and quantify factual errors in vision-language outputs while measuring how much of an image models actually describe, so product teams can choose between precise vs. comprehensive models and set verification policies accordingly.
Who Should Care
Summary TLDR
The authors release VALOR-BENCH, a human-annotated dataset that tests hallucinations in vision-language models across three dimensions: object existence, attributes (color/count), and relations (positional/comparative). They introduce VALOR-EVAL, a two-stage evaluation that uses an LLM (GPT-4) to extract and semantically match features from model captions and then compute faithfulness (precision) and coverage (recall). On 10 LVLMs, VALOR-EVAL correlates strongly with human judgment and reveals trade-offs: some models (e.g., Emu2) are very faithful but sparse, while others (e.g., GPT-4V) cover more but hallucinate more. The code and dataset are available on GitHub.
Problem Statement
Current benchmarks focus mostly on object-existence hallucinations and rely on fixed vocabularies. They miss attribute and relation errors and penalize informative captions by rewarding only precision. We need an open‑vocabulary, multi‑dimensional evaluation that measures both hallucination (faithfulness) and how much of the image a model describes (coverage).
Main Contribution
VALOR-BENCH: a human‑annotated image benchmark covering object existence, attributes (color and count), and relations (positional and comparative) with hard cases selected by co‑occurrence biases.
VALOR-EVAL: a two‑stage, LLM‑based evaluation that extracts features from free‑form captions and semantically matches them to ground truth, producing faithfulness and coverage scores in open‑vocabulary settings.
Key Findings
VALOR-EVAL strongly matches human judgment on attributes and objects.
Some models prioritize precision and omit many image details.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average faithfulness (model) | Emu2 74.98% (highest) | — | — | VALOR-BENCH | Table 3, Sec. 5.1 | Table 3 |
| Average coverage (model) | GPT-4V 28.0% (highest) | — | — | VALOR-BENCH | Table 3, Sec. 5.1 | Table 3 |
What To Try In 7 Days
Run VALOR-EVAL on your top LVLM to measure faithfulness vs coverage on a small set of images.
Add co‑occurrence selected cases (missing expected co-occurrences) to your test suite to expose associative hallucinations.
Use LLM-based semantic matching (VALOR-EVAL) instead of fixed synonym lists to evaluate open‑vocabulary outputs.
Agent Features
Tool Use
Frameworks
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
VALOR-BENCH focuses on color, count, positional and comparative relations and does not cover every attribute or relation type.
Evaluation uses a single prompt per subset; some models may need different prompt styles for best performance.
When Not To Use
When you need exhaustive attributes beyond color/count or richer relation types not covered here.
When you cannot afford the compute/cost to run an LLM as the automatic judge.
Failure Modes
LLM judge bias can misalign matches, especially for ambiguous or culturally specific terms.
Positional relation evaluations have lower human correlation than objects/attributes.

