VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

April 22, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, Nanyun Peng

Links

Abstract / PDF

Why It Matters For Business

VALOR-EVAL helps detect and quantify factual errors in vision-language outputs while measuring how much of an image models actually describe, so product teams can choose between precise vs. comprehensive models and set verification policies accordingly.

Summary TLDR

The authors release VALOR-BENCH, a human-annotated dataset that tests hallucinations in vision-language models across three dimensions: object existence, attributes (color/count), and relations (positional/comparative). They introduce VALOR-EVAL, a two-stage evaluation that uses an LLM (GPT-4) to extract and semantically match features from model captions and then compute faithfulness (precision) and coverage (recall). On 10 LVLMs, VALOR-EVAL correlates strongly with human judgment and reveals trade-offs: some models (e.g., Emu2) are very faithful but sparse, while others (e.g., GPT-4V) cover more but hallucinate more. The code and dataset are available on GitHub.

Problem Statement

Current benchmarks focus mostly on object-existence hallucinations and rely on fixed vocabularies. They miss attribute and relation errors and penalize informative captions by rewarding only precision. We need an open‑vocabulary, multi‑dimensional evaluation that measures both hallucination (faithfulness) and how much of the image a model describes (coverage).

Main Contribution

VALOR-BENCH: a human‑annotated image benchmark covering object existence, attributes (color and count), and relations (positional and comparative) with hard cases selected by co‑occurrence biases.

VALOR-EVAL: a two‑stage, LLM‑based evaluation that extracts features from free‑form captions and semantically matches them to ground truth, producing faithfulness and coverage scores in open‑vocabulary settings.

Large-scale evaluation: measured 10 mainstream LVLMs and showed strong correlation between VALOR-EVAL and human judgments, exposing faithfulness/coverage trade-offs across models.

Key Findings

VALOR-EVAL strongly matches human judgment on attributes and objects.

NumbersPearson ρ: object faithfulness 0.91, object coverage 0.89; attribute faithfulness up to 0.99

Some models prioritize precision and omit many image details.

NumbersEmu2 avg faithfulness 74.98% vs coverage 8.1%

Some models aim for breadth and include many details but hallucinate more.

NumbersGPT-4V avg coverage 28.0% and faithfulness 61.6%

Co‑occurrence based image selection creates a more challenging benchmark than random selection.

NumbersLLaVA-1.5 faithfulness drops 12.4 points vs random selection

LLM‑augmented matching outperforms fixed‑vocabulary CHAIR for hallucination detection.

NumbersAcc(F) improved +60 to +77.8 points across models in Table 6

Results

Average faithfulness (model)

ValueEmu2 74.98% (highest)

Average coverage (model)

ValueGPT-4V 28.0% (highest)

Object existence faithfulness correlation with humans

Valueρ = 0.91

Attribute (object) faithfulness correlation with humans

Valueρ = 0.99

Co-occurrence selection vs random (LLaVA-1.5 faithfulness)

Value72.1% (co-occurrence) vs 84.5% (random)

Baselinerandom selection

CHAIR LLM vs CHAIR (Acc(F))

ValueImprovements +60 to +77.78 points

Baselineoriginal CHAIR

Who Should Care

What To Try In 7 Days

Run VALOR-EVAL on your top LVLM to measure faithfulness vs coverage on a small set of images.

Add co‑occurrence selected cases (missing expected co-occurrences) to your test suite to expose associative hallucinations.

Use LLM-based semantic matching (VALOR-EVAL) instead of fixed synonym lists to evaluate open‑vocabulary outputs.

Agent Features

Tool Use

  • LLM-based evaluation (GPT-4 used as a judge)

Frameworks

  • VALOR-EVAL

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • VALOR-BENCH focuses on color, count, positional and comparative relations and does not cover every attribute or relation type.
  • Evaluation uses a single prompt per subset; some models may need different prompt styles for best performance.
  • VALOR-EVAL relies on GPT-4 as the judge, so any GPT-4 biases can affect scores.

When Not To Use

  • When you need exhaustive attributes beyond color/count or richer relation types not covered here.
  • When you cannot afford the compute/cost to run an LLM as the automatic judge.
  • When you require model evaluation under many alternate prompt styles without per-prompt calibration.

Failure Modes

  • LLM judge bias can misalign matches, especially for ambiguous or culturally specific terms.
  • Positional relation evaluations have lower human correlation than objects/attributes.
  • Models that avoid describing people (e.g., GPT-4V in some settings) can yield missing scores or skewed coverage.

Core Entities

Models

  • InstructBLIP
  • LLaVA-1.5
  • MiniGPT-4 v2
  • mPLUG-Owl2
  • BLIVA
  • CogVLM
  • InternLM-XComposer2
  • Qwen-VL-Chat
  • Emu2
  • GPT-4V

Metrics

  • faithfulness
  • coverage
  • VALOR-EVAL (LLM-based CHAIR generalization)
  • CHAIR

Datasets

  • VALOR-BENCH (this paper)
  • GQA
  • MSCOCO
  • Pixel/Pexels

Benchmarks

  • VALOR-BENCH
  • CHAIR
  • POPE
  • HaELM
  • HallusionBench
  • Halle-Switch
  • NOPE
  • Bingo
  • FaithScore
  • AMBER
  • MERLIM