Reduce VLLM hallucinations by fine-tuning with AI-generated 'wrong' answers

February 18, 20246 min

Overview

Decision SnapshotReady For Pilot

POVID is practical: it uses AI to create intentional errors and DPO to teach the model to prefer image-grounded answers; it needs GPT-4V access and modest GPU time and shows clear gains on multiple benchmarks.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.

Who Should Care

Summary TLDR

POVID fine-tunes vision-language models by creating dispreferred (intentionally wrong) answers with AI and noisy images, then applying Direct Preference Optimization (DPO). Dispreferred examples come from GPT-4V editing of ground-truth text and from model responses to slightly distorted images. Applied to LLaVA-1.5 (7B), POVID cuts object-hallucination scores (CHAIR S) from 66.8 to 31.8 and raises overall LLaVA-Bench from 63.4 to 68.7, while training on a single A100 GPU in hours. The method is scalable, needs no human labeling, and works best when both text-hallucination and image-distortion steps are used.

Problem Statement

Vision LLMs hallucinate—describe objects or relations not in the image—because image and text representations are not well aligned. Human feedback helps but is costly. We need a scalable way to teach VLLMs to prefer image-grounded answers over plausible but incorrect text-driven answers.

Main Contribution

POVID: a preference fine-tuning pipeline that uses only AI-generated dispreferences plus ground-truth preferred answers.

Two automated dispreference sources: (1) GPT-4V edits that inject plausible hallucinations into correct answers; (2) on-the-fly image distortion that triggers the model's inherent hallucination patterns.

Key Findings

POVID substantially reduces object-hallucination on captioning benchmarks.

NumbersCHAIR S: 66.831.8 (absolute -35.0)

Practical UseApply POVID to cut object-hallucination roughly in half on evaluated captioning tasks.

Evidence RefTable 1 (Hallucination Benchmark CHAIR S)

Combining AI-text hallucination and image distortion works better than either alone.

NumbersCHAIR S: text-only 39.6, image-only 50.4, combined 31.8

Practical UseUse both GPT-4V-based text edits and image-noising during DPO fine-tuning for best gains.

Evidence RefTable 3 (Ablation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
CHAIR S (lower is better)31.8LLaVA-1.5 66.8-35.0Hallucination benchmarksTable 1 reports CHAIR S for LLaVA-1.5 and POVIDTable 1
CHAIR I (lower is better)5.4LLaVA-1.5 12.7-7.3Hallucination benchmarksTable 1 reports CHAIR I for LLaVA-1.5 and POVIDTable 1

What To Try In 7 Days

Generate ~10k AI-edited dispreferred captions from a small ground-truth set using GPT-4V.

Implement DPO fine-tuning on your VLLM backbone for a few epochs on one A100-equivalent GPU.

Add lightweight image distortion during training to capture the model's inherent hallucinations.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on access to a strong multimodal editor (GPT-4V); quality depends on that model.

Evaluations are on LLaVA-1.5 backbone; results may vary on other architectures.

When Not To Use

You lack access to GPT-4V or similar multimodal editors.

You need human-verified labels for legal or medical auditing.

Failure Modes

Overfitting to the synthetic dispreferences, harming unrelated behaviors.

Generated dispreferences could introduce systemic biases if GPT-4V is biased.

Core Entities

Models

LLaVA-1.5 (7B)Vicuna (7B)GPT-4VViT-L

Metrics

CHAIR SCHAIR IPOPEMMHalSciQA-IMGLLaVA-Bench

Datasets

LLaVA-Instruct-150K (17K subset for dispreferences)SciQA-IMGMMbenchMM-VetLLaVA-BenchMME

Benchmarks

CHAIR (S and I)POPEMMHalSciQA-IMGMM-VetMMBenchLLaVA-Bench