Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
VALOR-EVAL helps detect and quantify factual errors in vision-language outputs while measuring how much of an image models actually describe, so product teams can choose between precise vs. comprehensive models and set verification policies accordingly.
Summary TLDR
The authors release VALOR-BENCH, a human-annotated dataset that tests hallucinations in vision-language models across three dimensions: object existence, attributes (color/count), and relations (positional/comparative). They introduce VALOR-EVAL, a two-stage evaluation that uses an LLM (GPT-4) to extract and semantically match features from model captions and then compute faithfulness (precision) and coverage (recall). On 10 LVLMs, VALOR-EVAL correlates strongly with human judgment and reveals trade-offs: some models (e.g., Emu2) are very faithful but sparse, while others (e.g., GPT-4V) cover more but hallucinate more. The code and dataset are available on GitHub.
Problem Statement
Current benchmarks focus mostly on object-existence hallucinations and rely on fixed vocabularies. They miss attribute and relation errors and penalize informative captions by rewarding only precision. We need an open‑vocabulary, multi‑dimensional evaluation that measures both hallucination (faithfulness) and how much of the image a model describes (coverage).
Main Contribution
VALOR-BENCH: a human‑annotated image benchmark covering object existence, attributes (color and count), and relations (positional and comparative) with hard cases selected by co‑occurrence biases.
VALOR-EVAL: a two‑stage, LLM‑based evaluation that extracts features from free‑form captions and semantically matches them to ground truth, producing faithfulness and coverage scores in open‑vocabulary settings.
Large-scale evaluation: measured 10 mainstream LVLMs and showed strong correlation between VALOR-EVAL and human judgments, exposing faithfulness/coverage trade-offs across models.
Key Findings
VALOR-EVAL strongly matches human judgment on attributes and objects.
Some models prioritize precision and omit many image details.
Some models aim for breadth and include many details but hallucinate more.
Co‑occurrence based image selection creates a more challenging benchmark than random selection.
LLM‑augmented matching outperforms fixed‑vocabulary CHAIR for hallucination detection.
Results
Average faithfulness (model)
Average coverage (model)
Object existence faithfulness correlation with humans
Attribute (object) faithfulness correlation with humans
Co-occurrence selection vs random (LLaVA-1.5 faithfulness)
CHAIR LLM vs CHAIR (Acc(F))
Who Should Care
What To Try In 7 Days
Run VALOR-EVAL on your top LVLM to measure faithfulness vs coverage on a small set of images.
Add co‑occurrence selected cases (missing expected co-occurrences) to your test suite to expose associative hallucinations.
Use LLM-based semantic matching (VALOR-EVAL) instead of fixed synonym lists to evaluate open‑vocabulary outputs.
Agent Features
Tool Use
- LLM-based evaluation (GPT-4 used as a judge)
Frameworks
- VALOR-EVAL
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- VALOR-BENCH focuses on color, count, positional and comparative relations and does not cover every attribute or relation type.
- Evaluation uses a single prompt per subset; some models may need different prompt styles for best performance.
- VALOR-EVAL relies on GPT-4 as the judge, so any GPT-4 biases can affect scores.
When Not To Use
- When you need exhaustive attributes beyond color/count or richer relation types not covered here.
- When you cannot afford the compute/cost to run an LLM as the automatic judge.
- When you require model evaluation under many alternate prompt styles without per-prompt calibration.
Failure Modes
- LLM judge bias can misalign matches, especially for ambiguous or culturally specific terms.
- Positional relation evaluations have lower human correlation than objects/attributes.
- Models that avoid describing people (e.g., GPT-4V in some settings) can yield missing scores or skewed coverage.
Core Entities
Models
- InstructBLIP
- LLaVA-1.5
- MiniGPT-4 v2
- mPLUG-Owl2
- BLIVA
- CogVLM
- InternLM-XComposer2
- Qwen-VL-Chat
- Emu2
- GPT-4V
Metrics
- faithfulness
- coverage
- VALOR-EVAL (LLM-based CHAIR generalization)
- CHAIR
Datasets
- VALOR-BENCH (this paper)
- GQA
- MSCOCO
- Pixel/Pexels
Benchmarks
- VALOR-BENCH
- CHAIR
- POPE
- HaELM
- HallusionBench
- Halle-Switch
- NOPE
- Bingo
- FaithScore
- AMBER
- MERLIM

