Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
ARA reduces factually incorrect image answers without costly retraining, so products that must avoid visual misinformation (e.g., medical imaging assistants, robotics, visual search) can improve trust with modest engineering work.
Summary TLDR
The paper proposes ARA, an Active Retrieval-Augmented framework for large vision-language models (LVLMs). ARA triggers retrieval only when the model shows low certainty, retrieves both full-image and object-level (coarse-to-fine) image-text pairs, reranks results by semantic caption similarity, and fuses retrieved knowledge at the probability level. Applied to LLaVA-1.5, Qwen-VL and mPLUG-Owl2 on four hallucination benchmarks, ARA raises detection and caption metrics (notably POPE, MME, MMStar, MMbench) while keeping retrieval frequency moderate. The approach trades modest latency and system complexity for robust reductions in object- and attribute-level hallucinations.
Problem Statement
Large vision-language models often output plausible but factually wrong text about images (hallucination). Existing fixes either require heavy retraining or are training-free but limited. Directly adapting text-based retrieval to LVLMs can sometimes worsen hallucination. The paper asks: how to design retrieval for LVLMs so retrieval helps rather than hurts?
Main Contribution
A practical ARA framework that (1) decomposes retrieval targets by image hierarchy (coarse-to-fine), (2) selects and filters retrieval results with reranking, and (3) triggers retrieval only when the model is uncertain using mutual-information-based metrics.
A concrete implementation and ablation across three LVLMs (LLaVA-1.5, Qwen-VL, mPLUG-Owl2) and four hallucination-focused benchmarks (POPE, MME, MMStar, MMbench).
Empirical evidence that careful retrieval design and active triggering improve object- and attribute-level hallucination scores and text-generation metrics versus vanilla LVLMs and a state-of-the-art decoding baseline (VCD).
Key Findings
Active retrieval (ARA) improves object-presence detection on POPE for LLaVA-1.5.
Coarse-to-fine retrieval raises attribute and overall hallucination scores on the MME subset.
ARA gives broad accuracy gains on a hard multimodal benchmark (MMStar).
Reranking and active triggering reduce noisy retrieval and unnecessary calls.
Text-generation quality (image captioning) improves under retrieval augmentation.
Results
Accuracy
POPE F1 (mPLUG-Owl2, random)
MME total score (LLaVA-1.5)
MMStar average (LLaVA-1.5)
MMbench object localization (LLaVA-1.5)
Caption quality average (Qwen-VL)
Who Should Care
What To Try In 7 Days
Add a query-aware trigger (mutual-information threshold) to call retrieval only when model uncertainty is high.
Implement coarse-to-fine retrieval: image-level CLIP search plus object crops using a grounding detector (Grounding DINO).
Rerank retrieved candidates by caption-caption similarity and fuse retrieved text at the probability level.
Agent Features
Tool Use
- CLIP embedding search
- Grounding DINO for object cropping
- LLaMA2-7B for entity extraction
Frameworks
- ARA (Active Retrieval-Augmented LVLM)
Architectures
- LVLM + retrieval augmentation (ARA)
Optimization Features
Token Efficiency
- Limit retrieval to 3–5 pairs; tuned per model (Fig.3, Sec.5.1)
System Optimization
- Reranking to reduce noisy retrieved items and improve precision
Training Optimization
- SFT
Inference Optimization
- Active triggering to reduce unnecessary retrievals
- Probabilistic-level fusion to combine coarse and fine retrieval efficiently
Reproducibility
Data Urls
- MSCOCO (public)
- VisualGenome (public)
- POPE, MME, MMStar, MMbench (public/benchmark sources)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Requires a relevant image-text retrieval database; poor coverage will limit benefits.
- Reranking and grounding steps add latency and system complexity.
- Performance sensitive to retrieval configuration (embeddings, number of pairs, fusion weights).
- Reranking cannot fully eliminate noisy captions or mismatched semantics from the retrieval pool.
When Not To Use
- When you lack a suitably large, domain-relevant retrieval database.
- When tight low-latency constraints prohibit extra retrieval and reranking.
- If the LVLM already has high confidence and strong factual grounding on your domain.
Failure Modes
- Grounding fails to locate the object → fine-grained retrieval cannot help.
- Retrieved captions are visually similar but semantically wrong → reranking may not filter all noise.
- Excessive retrieval triggers degrade performance by adding redundant or conflicting information.
- Model over-reliance on retrieved but incorrect external captions.
Core Entities
Models
- LLaVA-1.5
- Qwen-VL
- mPLUG-Owl2
- LLaMA2-7B (used for entity extraction)
Metrics
- Accuracy
- Precision
- Recall
- F1 score
- MMStar average
- MME total score
- BLEU-4
- CIDEr
- METEOR
- ROUGE-L
- SPICE
Datasets
- MSCOCO
- VisualGenome
- POPE
- MME
- MMStar
- MMbench
- A-OKVQA
- GQA
Benchmarks
- POPE
- MME
- MMStar
- MMbench

