Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.
Summary TLDR
POVID fine-tunes vision-language models by creating dispreferred (intentionally wrong) answers with AI and noisy images, then applying Direct Preference Optimization (DPO). Dispreferred examples come from GPT-4V editing of ground-truth text and from model responses to slightly distorted images. Applied to LLaVA-1.5 (7B), POVID cuts object-hallucination scores (CHAIR S) from 66.8 to 31.8 and raises overall LLaVA-Bench from 63.4 to 68.7, while training on a single A100 GPU in hours. The method is scalable, needs no human labeling, and works best when both text-hallucination and image-distortion steps are used.
Problem Statement
Vision LLMs hallucinate—describe objects or relations not in the image—because image and text representations are not well aligned. Human feedback helps but is costly. We need a scalable way to teach VLLMs to prefer image-grounded answers over plausible but incorrect text-driven answers.
Main Contribution
POVID: a preference fine-tuning pipeline that uses only AI-generated dispreferences plus ground-truth preferred answers.
Two automated dispreference sources: (1) GPT-4V edits that inject plausible hallucinations into correct answers; (2) on-the-fly image distortion that triggers the model's inherent hallucination patterns.
Integration of both dispreferences into Direct Preference Optimization (DPO) for stable, lightweight fine-tuning.
Empirical gains across multiple hallucination and comprehensive VLLM benchmarks with modest compute; code and data released.
Key Findings
POVID substantially reduces object-hallucination on captioning benchmarks.
Combining AI-text hallucination and image distortion works better than either alone.
POVID improves general VLLM performance beyond hallucination metrics.
Training is low-cost in hardware terms.
Results
CHAIR S (lower is better)
CHAIR I (lower is better)
POPE (higher is better)
LLaVA-Bench (higher is better)
Who Should Care
What To Try In 7 Days
Generate ~10k AI-edited dispreferred captions from a small ground-truth set using GPT-4V.
Implement DPO fine-tuning on your VLLM backbone for a few epochs on one A100-equivalent GPU.
Add lightweight image distortion during training to capture the model's inherent hallucinations.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on access to a strong multimodal editor (GPT-4V); quality depends on that model.
- Evaluations are on LLaVA-1.5 backbone; results may vary on other architectures.
- Choice and magnitude of image noise matter; poor settings can reduce benefit.
When Not To Use
- You lack access to GPT-4V or similar multimodal editors.
- You need human-verified labels for legal or medical auditing.
- Your deployment cannot tolerate AI-generated negative examples without review.
Failure Modes
- Overfitting to the synthetic dispreferences, harming unrelated behaviors.
- Generated dispreferences could introduce systemic biases if GPT-4V is biased.
- Image-noise settings too strong may push model away from valid visual cues.
Core Entities
Models
- LLaVA-1.5 (7B)
- Vicuna (7B)
- GPT-4V
- ViT-L
Metrics
- CHAIR S
- CHAIR I
- POPE
- MMHal
- SciQA-IMG
- LLaVA-Bench
Datasets
- LLaVA-Instruct-150K (17K subset for dispreferences)
- SciQA-IMG
- MMbench
- MM-Vet
- LLaVA-Bench
- MME
Benchmarks
- CHAIR (S and I)
- POPE
- MMHal
- SciQA-IMG
- MM-Vet
- MMBench
- LLaVA-Bench

