Reduce LVLM hallucinations by retrieving targeted image-text pairs only when the model is uncertain

August 1, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Xiaoye Qu, Qiyuan Chen, Wei Wei, Jishuo Sun, Jianfeng Dong

Links

Abstract / PDF

Why It Matters For Business

ARA reduces factually incorrect image answers without costly retraining, so products that must avoid visual misinformation (e.g., medical imaging assistants, robotics, visual search) can improve trust with modest engineering work.

Summary TLDR

The paper proposes ARA, an Active Retrieval-Augmented framework for large vision-language models (LVLMs). ARA triggers retrieval only when the model shows low certainty, retrieves both full-image and object-level (coarse-to-fine) image-text pairs, reranks results by semantic caption similarity, and fuses retrieved knowledge at the probability level. Applied to LLaVA-1.5, Qwen-VL and mPLUG-Owl2 on four hallucination benchmarks, ARA raises detection and caption metrics (notably POPE, MME, MMStar, MMbench) while keeping retrieval frequency moderate. The approach trades modest latency and system complexity for robust reductions in object- and attribute-level hallucinations.

Problem Statement

Large vision-language models often output plausible but factually wrong text about images (hallucination). Existing fixes either require heavy retraining or are training-free but limited. Directly adapting text-based retrieval to LVLMs can sometimes worsen hallucination. The paper asks: how to design retrieval for LVLMs so retrieval helps rather than hurts?

Main Contribution

A practical ARA framework that (1) decomposes retrieval targets by image hierarchy (coarse-to-fine), (2) selects and filters retrieval results with reranking, and (3) triggers retrieval only when the model is uncertain using mutual-information-based metrics.

A concrete implementation and ablation across three LVLMs (LLaVA-1.5, Qwen-VL, mPLUG-Owl2) and four hallucination-focused benchmarks (POPE, MME, MMStar, MMbench).

Empirical evidence that careful retrieval design and active triggering improve object- and attribute-level hallucination scores and text-generation metrics versus vanilla LVLMs and a state-of-the-art decoding baseline (VCD).

Key Findings

Active retrieval (ARA) improves object-presence detection on POPE for LLaVA-1.5.

NumbersAccuracy 86.50% → 89.43% (Random setting, Table 1)

Coarse-to-fine retrieval raises attribute and overall hallucination scores on the MME subset.

NumbersLLaVA total 605.00 → 648.33 (MME hallucination subset, Table 2)

ARA gives broad accuracy gains on a hard multimodal benchmark (MMStar).

NumbersLLaVA average 0.321 → 0.409 (avg +0.088 ≈ 8.8 percentage points, Table 3)

Reranking and active triggering reduce noisy retrieval and unnecessary calls.

NumbersReranking increases accuracy 86.93% → 87.17% (POPE, Table 7); query-aware trigger chosen for stability (Figure 5).

Text-generation quality (image captioning) improves under retrieval augmentation.

NumbersQwen-VL Ave. score 43.4 → 58.3 (+14.9 avg, Table 9); LLaVA avg +5.0, mPLUG +3.2

Results

Accuracy

Value89.43%

Baseline86.50%

POPE F1 (mPLUG-Owl2, random)

Value89.01

Baseline84.38

MME total score (LLaVA-1.5)

Value648.33

Baseline605.00

MMStar average (LLaVA-1.5)

Value0.409

Baseline0.321

MMbench object localization (LLaVA-1.5)

Value0.654

Baseline0.6032

Caption quality average (Qwen-VL)

Value58.3

Baseline43.4

Who Should Care

What To Try In 7 Days

Add a query-aware trigger (mutual-information threshold) to call retrieval only when model uncertainty is high.

Implement coarse-to-fine retrieval: image-level CLIP search plus object crops using a grounding detector (Grounding DINO).

Rerank retrieved candidates by caption-caption similarity and fuse retrieved text at the probability level.

Agent Features

Tool Use

  • CLIP embedding search
  • Grounding DINO for object cropping
  • LLaMA2-7B for entity extraction

Frameworks

  • ARA (Active Retrieval-Augmented LVLM)

Architectures

  • LVLM + retrieval augmentation (ARA)

Optimization Features

Token Efficiency

  • Limit retrieval to 3–5 pairs; tuned per model (Fig.3, Sec.5.1)

System Optimization

  • Reranking to reduce noisy retrieved items and improve precision

Training Optimization

  • SFT

Inference Optimization

  • Active triggering to reduce unnecessary retrievals
  • Probabilistic-level fusion to combine coarse and fine retrieval efficiently

Reproducibility

Data Urls

  • MSCOCO (public)
  • VisualGenome (public)
  • POPE, MME, MMStar, MMbench (public/benchmark sources)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires a relevant image-text retrieval database; poor coverage will limit benefits.
  • Reranking and grounding steps add latency and system complexity.
  • Performance sensitive to retrieval configuration (embeddings, number of pairs, fusion weights).
  • Reranking cannot fully eliminate noisy captions or mismatched semantics from the retrieval pool.

When Not To Use

  • When you lack a suitably large, domain-relevant retrieval database.
  • When tight low-latency constraints prohibit extra retrieval and reranking.
  • If the LVLM already has high confidence and strong factual grounding on your domain.

Failure Modes

  • Grounding fails to locate the object → fine-grained retrieval cannot help.
  • Retrieved captions are visually similar but semantically wrong → reranking may not filter all noise.
  • Excessive retrieval triggers degrade performance by adding redundant or conflicting information.
  • Model over-reliance on retrieved but incorrect external captions.

Core Entities

Models

  • LLaVA-1.5
  • Qwen-VL
  • mPLUG-Owl2
  • LLaMA2-7B (used for entity extraction)

Metrics

  • Accuracy
  • Precision
  • Recall
  • F1 score
  • MMStar average
  • MME total score
  • BLEU-4
  • CIDEr
  • METEOR
  • ROUGE-L
  • SPICE

Datasets

  • MSCOCO
  • VisualGenome
  • POPE
  • MME
  • MMStar
  • MMbench
  • A-OKVQA
  • GQA

Benchmarks

  • POPE
  • MME
  • MMStar
  • MMbench