HOPE: search image-specific, highly misleading distractors to better expose object hallucinations in LVLMs

August 3, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.3

Citation Count

0

Authors

Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, Masashi Sugiyama

Links

Abstract / PDF

Why It Matters For Business

Existing benchmarks can understate hallucination. Use HOPE-style, image-aware distractors to better find model failures before deployment and reduce risky product behavior.

Summary TLDR

The paper introduces HOPE, a benchmark that builds harder distractors to reveal object hallucination in large vision-language models (LVLMs). It searches for distractors via three strategies: category-oriented (co-occurrence and visual similarity), content-aware (uses CLIP to pick image-specific negatives), and description-based (true object + false description). Across multiple LVLMs and datasets, HOPE drops precision by 9–24% vs. POPE, showing existing benchmarks under-estimate hallucination. The code and toolkit are provided for reproducible QA construction.

Problem Statement

Current object-hallucination tests (like POPE) sample generic negative categories and ignore image-specific ambiguity. As LVLMs improve, these simple distractors stop stressing models, so benchmarks under-report hallucination. We need a way to generate the most misleading, instance-dependent distractors to test real robustness.

Main Contribution

HOPE benchmark: formalizes hallucination evaluation as a search for distractors most likely to trigger hallucination.

Three search strategies: category-oriented (co-occurrence + visual similarity), content-aware (CLIP-based image grounding), and description-based (true object + false attribute/state).

Toolkit and data pipeline to build QA pairs, including human verification and flexible sampling.

Empirical study across Objects365, MS-COCO, VG, OpenImages showing HOPE reveals larger hallucination than POPE.

Key Findings

HOPE's description-based distractors lower model precision much more than POPE.

NumbersΔ precision 9.2%–23.7%, average −16.3% (Table 2)

Content-aware searching (image-specific via CLIP) is the single most effective strategy.

Combining the three searching strategies produces largely complementary distractors and a stronger attack.

Results

Precision (description-based HOPE vs POPE)

ValueLLaVA-Next: 66.51% (HOPE)

Baseline90.04% (POPE)

Precision (description-based HOPE vs POPE)

ValueQwen2.5-VL: 81.35% (HOPE)

Baseline93.98% (POPE)

Precision (description-based HOPE vs POPE)

ValueInternVL2.5: 80.72% (HOPE)

Baseline89.76% (POPE)

Who Should Care

What To Try In 7 Days

Run CLIP-based content-aware negatives on a sample of your images to find ambiguous failure modes.

Add description-based (true object + false attribute) QA to your test suite to catch relational/attribute hallucinations.

Combine category-level and image-specific distractors and measure precision drop versus current tests.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on CLIP as surrogate; surrogate mismatch can misrank distractors for specific LVLMs.
  • Description-based distractors need manual verification to avoid illogical pairs.
  • Focuses on object-level and attribute/state distractors; does not cover full caption-level or multimodal reasoning failures.

When Not To Use

  • When you can query the target LVLM directly and want live adversarial probing.
  • When your evaluation target is free-form caption generation rather than QA-style presence tests.
  • If you cannot afford manual verification for description-based items.

Failure Modes

  • CLIP may prefer visually similar but semantically irrelevant negatives, causing false positives.
  • Description pairs may be grammatically odd or nonsensical before human filtering.
  • Small distractor spaces miss hard negatives; too-large spaces increase compute cost.

Core Entities

Models

  • LLaVA-Next 8B
  • LLaVA-OV 7B
  • Qwen2-VL 7B
  • Qwen2.5-VL 7B
  • InternVL2.5 8B
  • InternVL3 8B
  • CLIP (used as surrogate)

Metrics

  • Precision
  • Recall
  • F1 Score

Datasets

  • Objects365
  • MS-COCO
  • Visual Genome
  • OpenImages
  • Large-Scale Attribute (LSA)

Benchmarks

  • HOPE
  • POPE
  • H-POPE
  • R-Bench
  • CHAIR
  • NOPE
  • ROPE