VALOR-EVAL: an LLM-driven open‑vocabulary benchmark that measures both hallucination and coverage across objects, attributes, and relations

Overview

Decision SnapshotReady For Pilot

The benchmark and LLM-driven metric are ready for adoption in evaluation pipelines, backed by strong human correlation, but require running an LLM judge and careful prompt design.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Haoyi Qiu, Wenbo Hu, Zi-Yi Dou, Nanyun Peng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VALOR-EVAL helps detect and quantify factual errors in vision-language outputs while measuring how much of an image models actually describe, so product teams can choose between precise vs. comprehensive models and set verification policies accordingly.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The authors release VALOR-BENCH, a human-annotated dataset that tests hallucinations in vision-language models across three dimensions: object existence, attributes (color/count), and relations (positional/comparative). They introduce VALOR-EVAL, a two-stage evaluation that uses an LLM (GPT-4) to extract and semantically match features from model captions and then compute faithfulness (precision) and coverage (recall). On 10 LVLMs, VALOR-EVAL correlates strongly with human judgment and reveals trade-offs: some models (e.g., Emu2) are very faithful but sparse, while others (e.g., GPT-4V) cover more but hallucinate more. The code and dataset are available on GitHub.

Problem Statement

Current benchmarks focus mostly on object-existence hallucinations and rely on fixed vocabularies. They miss attribute and relation errors and penalize informative captions by rewarding only precision. We need an open‑vocabulary, multi‑dimensional evaluation that measures both hallucination (faithfulness) and how much of the image a model describes (coverage).

Main Contribution

VALOR-BENCH: a human‑annotated image benchmark covering object existence, attributes (color and count), and relations (positional and comparative) with hard cases selected by co‑occurrence biases.

VALOR-EVAL: a two‑stage, LLM‑based evaluation that extracts features from free‑form captions and semantically matches them to ground truth, producing faithfulness and coverage scores in open‑vocabulary settings.

Key Findings

VALOR-EVAL strongly matches human judgment on attributes and objects.

NumbersPearson ρ: object faithfulness 0.91, object coverage 0.89; attribute faithfulness up to 0.99

Practical UseUse VALOR-EVAL to automate human-like evaluation for object and attribute checks; expect high agreement with human raters on these categories.

Evidence RefTable 4, Sec. 5.2

Some models prioritize precision and omit many image details.

NumbersEmu2 avg faithfulness 74.98% vs coverage 8.1%

Practical UseIf you need accurate but sparse captions (few false claims), prefer models like Emu2; for broader scene coverage accept higher risk of hallucination.

Evidence RefTable 3, Sec. 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average faithfulness (model)	Emu2 74.98% (highest)	—	—	VALOR-BENCH	Table 3, Sec. 5.1	Table 3
Average coverage (model)	GPT-4V 28.0% (highest)	—	—	VALOR-BENCH	Table 3, Sec. 5.1	Table 3

What To Try In 7 Days

Run VALOR-EVAL on your top LVLM to measure faithfulness vs coverage on a small set of images.

Add co‑occurrence selected cases (missing expected co-occurrences) to your test suite to expose associative hallucinations.

Use LLM-based semantic matching (VALOR-EVAL) instead of fixed synonym lists to evaluate open‑vocabulary outputs.

Agent Features

Tool Use

LLM-based evaluation (GPT-4 used as a judge)

Frameworks

VALOR-EVAL

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/haoyiq114/VALOR

Data URLs

https://github.com/haoyiq114/VALOR

Risks & Boundaries

Limitations

VALOR-BENCH focuses on color, count, positional and comparative relations and does not cover every attribute or relation type.

Evaluation uses a single prompt per subset; some models may need different prompt styles for best performance.

When Not To Use

When you need exhaustive attributes beyond color/count or richer relation types not covered here.

When you cannot afford the compute/cost to run an LLM as the automatic judge.

Failure Modes

LLM judge bias can misalign matches, especially for ambiguous or culturally specific terms.

Positional relation evaluations have lower human correlation than objects/attributes.

Core Entities

Models

InstructBLIPLLaVA-1.5MiniGPT-4 v2mPLUG-Owl2BLIVACogVLMInternLM-XComposer2Qwen-VL-ChatEmu2GPT-4V

Metrics

faithfulnesscoverageVALOR-EVAL (LLM-based CHAIR generalization)CHAIR

Datasets

VALOR-BENCH (this paper)GQAMSCOCOPixel/Pexels

Benchmarks

VALOR-BENCHCHAIRPOPEHaELMHallusionBenchHalle-SwitchNOPEBingoFaithScoreAMBERMERLIM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

VALOR-EVAL strongly matches human judgment on attributes and objects.

Some models prioritize precision and omit many image details.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding