Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
Existing benchmarks can understate hallucination. Use HOPE-style, image-aware distractors to better find model failures before deployment and reduce risky product behavior.
Summary TLDR
The paper introduces HOPE, a benchmark that builds harder distractors to reveal object hallucination in large vision-language models (LVLMs). It searches for distractors via three strategies: category-oriented (co-occurrence and visual similarity), content-aware (uses CLIP to pick image-specific negatives), and description-based (true object + false description). Across multiple LVLMs and datasets, HOPE drops precision by 9–24% vs. POPE, showing existing benchmarks under-estimate hallucination. The code and toolkit are provided for reproducible QA construction.
Problem Statement
Current object-hallucination tests (like POPE) sample generic negative categories and ignore image-specific ambiguity. As LVLMs improve, these simple distractors stop stressing models, so benchmarks under-report hallucination. We need a way to generate the most misleading, instance-dependent distractors to test real robustness.
Main Contribution
HOPE benchmark: formalizes hallucination evaluation as a search for distractors most likely to trigger hallucination.
Three search strategies: category-oriented (co-occurrence + visual similarity), content-aware (CLIP-based image grounding), and description-based (true object + false attribute/state).
Toolkit and data pipeline to build QA pairs, including human verification and flexible sampling.
Empirical study across Objects365, MS-COCO, VG, OpenImages showing HOPE reveals larger hallucination than POPE.
Key Findings
HOPE's description-based distractors lower model precision much more than POPE.
Content-aware searching (image-specific via CLIP) is the single most effective strategy.
Combining the three searching strategies produces largely complementary distractors and a stronger attack.
Results
Precision (description-based HOPE vs POPE)
Precision (description-based HOPE vs POPE)
Precision (description-based HOPE vs POPE)
Who Should Care
What To Try In 7 Days
Run CLIP-based content-aware negatives on a sample of your images to find ambiguous failure modes.
Add description-based (true object + false attribute) QA to your test suite to catch relational/attribute hallucinations.
Combine category-level and image-specific distractors and measure precision drop versus current tests.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on CLIP as surrogate; surrogate mismatch can misrank distractors for specific LVLMs.
- Description-based distractors need manual verification to avoid illogical pairs.
- Focuses on object-level and attribute/state distractors; does not cover full caption-level or multimodal reasoning failures.
When Not To Use
- When you can query the target LVLM directly and want live adversarial probing.
- When your evaluation target is free-form caption generation rather than QA-style presence tests.
- If you cannot afford manual verification for description-based items.
Failure Modes
- CLIP may prefer visually similar but semantically irrelevant negatives, causing false positives.
- Description pairs may be grammatically odd or nonsensical before human filtering.
- Small distractor spaces miss hard negatives; too-large spaces increase compute cost.
Core Entities
Models
- LLaVA-Next 8B
- LLaVA-OV 7B
- Qwen2-VL 7B
- Qwen2.5-VL 7B
- InternVL2.5 8B
- InternVL3 8B
- CLIP (used as surrogate)
Metrics
- Precision
- Recall
- F1 Score
Datasets
- Objects365
- MS-COCO
- Visual Genome
- OpenImages
- Large-Scale Attribute (LSA)
Benchmarks
- HOPE
- POPE
- H-POPE
- R-Bench
- CHAIR
- NOPE
- ROPE

