Reduce VLLM hallucinations by fine-tuning with AI-generated 'wrong' answers

Overview

Decision SnapshotReady For Pilot

POVID is practical: it uses AI to create intentional errors and DPO to teach the model to prefer image-grounded answers; it needs GPT-4V access and modest GPU time and shows clear gains on multiple benchmarks.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

POVID fine-tunes vision-language models by creating dispreferred (intentionally wrong) answers with AI and noisy images, then applying Direct Preference Optimization (DPO). Dispreferred examples come from GPT-4V editing of ground-truth text and from model responses to slightly distorted images. Applied to LLaVA-1.5 (7B), POVID cuts object-hallucination scores (CHAIR S) from 66.8 to 31.8 and raises overall LLaVA-Bench from 63.4 to 68.7, while training on a single A100 GPU in hours. The method is scalable, needs no human labeling, and works best when both text-hallucination and image-distortion steps are used.

Problem Statement

Vision LLMs hallucinate—describe objects or relations not in the image—because image and text representations are not well aligned. Human feedback helps but is costly. We need a scalable way to teach VLLMs to prefer image-grounded answers over plausible but incorrect text-driven answers.

Main Contribution

POVID: a preference fine-tuning pipeline that uses only AI-generated dispreferences plus ground-truth preferred answers.

Two automated dispreference sources: (1) GPT-4V edits that inject plausible hallucinations into correct answers; (2) on-the-fly image distortion that triggers the model's inherent hallucination patterns.

Key Findings

POVID substantially reduces object-hallucination on captioning benchmarks.

NumbersCHAIR S: 66.8 → 31.8 (absolute -35.0)

Practical UseApply POVID to cut object-hallucination roughly in half on evaluated captioning tasks.

Evidence RefTable 1 (Hallucination Benchmark CHAIR S)

Combining AI-text hallucination and image distortion works better than either alone.

NumbersCHAIR S: text-only 39.6, image-only 50.4, combined 31.8

Practical UseUse both GPT-4V-based text edits and image-noising during DPO fine-tuning for best gains.

Evidence RefTable 3 (Ablation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
CHAIR S (lower is better)	31.8	LLaVA-1.5 66.8	-35.0	Hallucination benchmarks	Table 1 reports CHAIR S for LLaVA-1.5 and POVID	Table 1
CHAIR I (lower is better)	5.4	LLaVA-1.5 12.7	-7.3	Hallucination benchmarks	Table 1 reports CHAIR I for LLaVA-1.5 and POVID	Table 1

What To Try In 7 Days

Generate ~10k AI-edited dispreferred captions from a small ground-truth set using GPT-4V.

Implement DPO fine-tuning on your VLLM backbone for a few epochs on one A100-equivalent GPU.

Add lightweight image distortion during training to capture the model's inherent hallucinations.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/YiyangZhou/POVID

Data URLs

https://github.com/YiyangZhou/POVID

Risks & Boundaries

Limitations

Relies on access to a strong multimodal editor (GPT-4V); quality depends on that model.

Evaluations are on LLaVA-1.5 backbone; results may vary on other architectures.

When Not To Use

You lack access to GPT-4V or similar multimodal editors.

You need human-verified labels for legal or medical auditing.

Failure Modes

Overfitting to the synthetic dispreferences, harming unrelated behaviors.

Generated dispreferences could introduce systemic biases if GPT-4V is biased.

Core Entities

Models

LLaVA-1.5 (7B)Vicuna (7B)GPT-4VViT-L

Metrics

CHAIR SCHAIR IPOPEMMHalSciQA-IMGLLaVA-Bench

Datasets

LLaVA-Instruct-150K (17K subset for dispreferences)SciQA-IMGMMbenchMM-VetLLaVA-BenchMME

Benchmarks

CHAIR (S and I)POPEMMHalSciQA-IMGMM-VetMMBenchLLaVA-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

POVID substantially reduces object-hallucination on captioning benchmarks.

Combining AI-text hallucination and image distortion works better than either alone.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Use a small assistant LLM to remove teacher-model favoritism from proxy judge training

Key finding

Use synthetic crowd comparisons to make LLM judges give deeper, more reliable chain-of-thought evaluations

Key finding