Reduce LVLM hallucinations by retrieving targeted image-text pairs only when the model is uncertain

August 1, 20248 min

Overview

Decision SnapshotNeeds Validation

The method uses standard components (CLIP, grounding detector, captioning) and public datasets; experiments on three LVLMs across four benchmarks show consistent gains, but gains depend on retrieval configuration and dataset coverage.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Xiaoye Qu, Qiyuan Chen, Wei Wei, Jishuo Sun, Jianfeng Dong

Links

Abstract / PDF / Data

Why It Matters For Business

ARA reduces factually incorrect image answers without costly retraining, so products that must avoid visual misinformation (e.g., medical imaging assistants, robotics, visual search) can improve trust with modest engineering work.

Who Should Care

Summary TLDR

The paper proposes ARA, an Active Retrieval-Augmented framework for large vision-language models (LVLMs). ARA triggers retrieval only when the model shows low certainty, retrieves both full-image and object-level (coarse-to-fine) image-text pairs, reranks results by semantic caption similarity, and fuses retrieved knowledge at the probability level. Applied to LLaVA-1.5, Qwen-VL and mPLUG-Owl2 on four hallucination benchmarks, ARA raises detection and caption metrics (notably POPE, MME, MMStar, MMbench) while keeping retrieval frequency moderate. The approach trades modest latency and system complexity for robust reductions in object- and attribute-level hallucinations.

Problem Statement

Large vision-language models often output plausible but factually wrong text about images (hallucination). Existing fixes either require heavy retraining or are training-free but limited. Directly adapting text-based retrieval to LVLMs can sometimes worsen hallucination. The paper asks: how to design retrieval for LVLMs so retrieval helps rather than hurts?

Main Contribution

A practical ARA framework that (1) decomposes retrieval targets by image hierarchy (coarse-to-fine), (2) selects and filters retrieval results with reranking, and (3) triggers retrieval only when the model is uncertain using mutual-information-based metrics.

A concrete implementation and ablation across three LVLMs (LLaVA-1.5, Qwen-VL, mPLUG-Owl2) and four hallucination-focused benchmarks (POPE, MME, MMStar, MMbench).

Key Findings

Active retrieval (ARA) improves object-presence detection on POPE for LLaVA-1.5.

NumbersAccuracy 86.50%89.43% (Random setting, Table 1)

Practical UseAdd ARA to LLaVA-like models to increase object-detection accuracy by ~3 points on POPE-style tests; expect fewer missed object detections.

Evidence RefTable 1 (POPE, LLaVA-1.5, Random)

Coarse-to-fine retrieval raises attribute and overall hallucination scores on the MME subset.

NumbersLLaVA total 605.00648.33 (MME hallucination subset, Table 2)

Practical UseUse object-level (fine) retrieval plus coarse retrieval to improve attribute judgments (color/position) and overall factual consistency on image reasoning benchmarks.

Evidence RefTable 2 (MME hallucination subset)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy89.43%86.50%+2.93 ppPOPE (random)Table 1 POPE resultsTable 1
POPE F1 (mPLUG-Owl2, random)89.0184.38+4.63POPE (random)Table 1 POPE resultsTable 1

What To Try In 7 Days

Add a query-aware trigger (mutual-information threshold) to call retrieval only when model uncertainty is high.

Implement coarse-to-fine retrieval: image-level CLIP search plus object crops using a grounding detector (Grounding DINO).

Rerank retrieved candidates by caption-caption similarity and fuse retrieved text at the probability level.

Agent Features

Tool Use
CLIP embedding searchGrounding DINO for object croppingLLaMA2-7B for entity extraction
Frameworks
ARA (Active Retrieval-Augmented LVLM)
Architectures
LVLM + retrieval augmentation (ARA)

Optimization Features

Token Efficiency
Limit retrieval to 3–5 pairs; tuned per model (Fig.3, Sec.5.1)
System Optimization
Reranking to reduce noisy retrieved items and improve precision
Training Optimization
SFT
Inference Optimization
Active triggering to reduce unnecessary retrievalsProbabilistic-level fusion to combine coarse and fine retrieval efficiently

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

MSCOCO (public)VisualGenome (public)POPE, MME, MMStar, MMbench (public/benchmark sources)

Risks & Boundaries

Limitations

Requires a relevant image-text retrieval database; poor coverage will limit benefits.

Reranking and grounding steps add latency and system complexity.

When Not To Use

When you lack a suitably large, domain-relevant retrieval database.

When tight low-latency constraints prohibit extra retrieval and reranking.

Failure Modes

Grounding fails to locate the object → fine-grained retrieval cannot help.

Retrieved captions are visually similar but semantically wrong → reranking may not filter all noise.

Core Entities

Models

LLaVA-1.5Qwen-VLmPLUG-Owl2LLaMA2-7B (used for entity extraction)

Metrics

AccuracyPrecisionRecallF1 scoreMMStar averageMME total scoreBLEU-4CIDErMETEORROUGE-LSPICE

Datasets

MSCOCOVisualGenomePOPEMMEMMStarMMbenchA-OKVQAGQA

Benchmarks

POPEMMEMMStarMMbench