VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

April 22, 20247 min

Overview

Decision SnapshotReady For Pilot

The benchmark and LLM-driven metric are ready for adoption in evaluation pipelines, backed by strong human correlation, but require running an LLM judge and careful prompt design.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, Nanyun Peng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VALOR-EVAL helps detect and quantify factual errors in vision-language outputs while measuring how much of an image models actually describe, so product teams can choose between precise vs. comprehensive models and set verification policies accordingly.

Who Should Care

Summary TLDR

The authors release VALOR-BENCH, a human-annotated dataset that tests hallucinations in vision-language models across three dimensions: object existence, attributes (color/count), and relations (positional/comparative). They introduce VALOR-EVAL, a two-stage evaluation that uses an LLM (GPT-4) to extract and semantically match features from model captions and then compute faithfulness (precision) and coverage (recall). On 10 LVLMs, VALOR-EVAL correlates strongly with human judgment and reveals trade-offs: some models (e.g., Emu2) are very faithful but sparse, while others (e.g., GPT-4V) cover more but hallucinate more. The code and dataset are available on GitHub.

Problem Statement

Current benchmarks focus mostly on object-existence hallucinations and rely on fixed vocabularies. They miss attribute and relation errors and penalize informative captions by rewarding only precision. We need an open‑vocabulary, multi‑dimensional evaluation that measures both hallucination (faithfulness) and how much of the image a model describes (coverage).

Main Contribution

VALOR-BENCH: a human‑annotated image benchmark covering object existence, attributes (color and count), and relations (positional and comparative) with hard cases selected by co‑occurrence biases.

VALOR-EVAL: a two‑stage, LLM‑based evaluation that extracts features from free‑form captions and semantically matches them to ground truth, producing faithfulness and coverage scores in open‑vocabulary settings.

Key Findings

VALOR-EVAL strongly matches human judgment on attributes and objects.

NumbersPearson ρ: object faithfulness 0.91, object coverage 0.89; attribute faithfulness up to 0.99

Practical UseUse VALOR-EVAL to automate human-like evaluation for object and attribute checks; expect high agreement with human raters on these categories.

Evidence RefTable 4, Sec. 5.2

Some models prioritize precision and omit many image details.

NumbersEmu2 avg faithfulness 74.98% vs coverage 8.1%

Practical UseIf you need accurate but sparse captions (few false claims), prefer models like Emu2; for broader scene coverage accept higher risk of hallucination.

Evidence RefTable 3, Sec. 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average faithfulness (model)Emu2 74.98% (highest)VALOR-BENCHTable 3, Sec. 5.1Table 3
Average coverage (model)GPT-4V 28.0% (highest)VALOR-BENCHTable 3, Sec. 5.1Table 3

What To Try In 7 Days

Run VALOR-EVAL on your top LVLM to measure faithfulness vs coverage on a small set of images.

Add co‑occurrence selected cases (missing expected co-occurrences) to your test suite to expose associative hallucinations.

Use LLM-based semantic matching (VALOR-EVAL) instead of fixed synonym lists to evaluate open‑vocabulary outputs.

Agent Features

Tool Use
LLM-based evaluation (GPT-4 used as a judge)
Frameworks
VALOR-EVAL

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

VALOR-BENCH focuses on color, count, positional and comparative relations and does not cover every attribute or relation type.

Evaluation uses a single prompt per subset; some models may need different prompt styles for best performance.

When Not To Use

When you need exhaustive attributes beyond color/count or richer relation types not covered here.

When you cannot afford the compute/cost to run an LLM as the automatic judge.

Failure Modes

LLM judge bias can misalign matches, especially for ambiguous or culturally specific terms.

Positional relation evaluations have lower human correlation than objects/attributes.

Core Entities

Models

InstructBLIPLLaVA-1.5MiniGPT-4 v2mPLUG-Owl2BLIVACogVLMInternLM-XComposer2Qwen-VL-ChatEmu2GPT-4V

Metrics

faithfulnesscoverageVALOR-EVAL (LLM-based CHAIR generalization)CHAIR

Datasets

VALOR-BENCH (this paper)GQAMSCOCOPixel/Pexels

Benchmarks

VALOR-BENCHCHAIRPOPEHaELMHallusionBenchHalle-SwitchNOPEBingoFaithScoreAMBERMERLIM