Overview
POVID is practical: it uses AI to create intentional errors and DPO to teach the model to prefer image-grounded answers; it needs GPT-4V access and modest GPU time and shows clear gains on multiple benchmarks.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.
Who Should Care
Summary TLDR
POVID fine-tunes vision-language models by creating dispreferred (intentionally wrong) answers with AI and noisy images, then applying Direct Preference Optimization (DPO). Dispreferred examples come from GPT-4V editing of ground-truth text and from model responses to slightly distorted images. Applied to LLaVA-1.5 (7B), POVID cuts object-hallucination scores (CHAIR S) from 66.8 to 31.8 and raises overall LLaVA-Bench from 63.4 to 68.7, while training on a single A100 GPU in hours. The method is scalable, needs no human labeling, and works best when both text-hallucination and image-distortion steps are used.
Problem Statement
Vision LLMs hallucinate—describe objects or relations not in the image—because image and text representations are not well aligned. Human feedback helps but is costly. We need a scalable way to teach VLLMs to prefer image-grounded answers over plausible but incorrect text-driven answers.
Main Contribution
POVID: a preference fine-tuning pipeline that uses only AI-generated dispreferences plus ground-truth preferred answers.
Two automated dispreference sources: (1) GPT-4V edits that inject plausible hallucinations into correct answers; (2) on-the-fly image distortion that triggers the model's inherent hallucination patterns.
Key Findings
POVID substantially reduces object-hallucination on captioning benchmarks.
Combining AI-text hallucination and image distortion works better than either alone.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| CHAIR S (lower is better) | 31.8 | LLaVA-1.5 66.8 | -35.0 | Hallucination benchmarks | Table 1 reports CHAIR S for LLaVA-1.5 and POVID | Table 1 |
| CHAIR I (lower is better) | 5.4 | LLaVA-1.5 12.7 | -7.3 | Hallucination benchmarks | Table 1 reports CHAIR I for LLaVA-1.5 and POVID | Table 1 |
What To Try In 7 Days
Generate ~10k AI-edited dispreferred captions from a small ground-truth set using GPT-4V.
Implement DPO fine-tuning on your VLLM backbone for a few epochs on one A100-equivalent GPU.
Add lightweight image distortion during training to capture the model's inherent hallucinations.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Relies on access to a strong multimodal editor (GPT-4V); quality depends on that model.
Evaluations are on LLaVA-1.5 backbone; results may vary on other architectures.
When Not To Use
You lack access to GPT-4V or similar multimodal editors.
You need human-verified labels for legal or medical auditing.
Failure Modes
Overfitting to the synthetic dispreferences, harming unrelated behaviors.
Generated dispreferences could introduce systemic biases if GPT-4V is biased.

