Reduce VLLM hallucinations by fine-tuning with AI-generated 'wrong' answers

February 18, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

Links

Abstract / PDF

Why It Matters For Business

POVID reduces image-driven hallucination and raises overall VLLM reliability while avoiding costly human preference annotation, enabling faster, cheaper deployment of multimodal assistants.

Summary TLDR

POVID fine-tunes vision-language models by creating dispreferred (intentionally wrong) answers with AI and noisy images, then applying Direct Preference Optimization (DPO). Dispreferred examples come from GPT-4V editing of ground-truth text and from model responses to slightly distorted images. Applied to LLaVA-1.5 (7B), POVID cuts object-hallucination scores (CHAIR S) from 66.8 to 31.8 and raises overall LLaVA-Bench from 63.4 to 68.7, while training on a single A100 GPU in hours. The method is scalable, needs no human labeling, and works best when both text-hallucination and image-distortion steps are used.

Problem Statement

Vision LLMs hallucinate—describe objects or relations not in the image—because image and text representations are not well aligned. Human feedback helps but is costly. We need a scalable way to teach VLLMs to prefer image-grounded answers over plausible but incorrect text-driven answers.

Main Contribution

POVID: a preference fine-tuning pipeline that uses only AI-generated dispreferences plus ground-truth preferred answers.

Two automated dispreference sources: (1) GPT-4V edits that inject plausible hallucinations into correct answers; (2) on-the-fly image distortion that triggers the model's inherent hallucination patterns.

Integration of both dispreferences into Direct Preference Optimization (DPO) for stable, lightweight fine-tuning.

Empirical gains across multiple hallucination and comprehensive VLLM benchmarks with modest compute; code and data released.

Key Findings

POVID substantially reduces object-hallucination on captioning benchmarks.

NumbersCHAIR S: 66.8 → 31.8 (absolute -35.0)

Combining AI-text hallucination and image distortion works better than either alone.

NumbersCHAIR S: text-only 39.6, image-only 50.4, combined 31.8

POVID improves general VLLM performance beyond hallucination metrics.

NumbersLLaVA-Bench: 63.4 → 68.7 (+5.3)

Training is low-cost in hardware terms.

NumbersSingle A100 80GB, reported time 6–20 hours

Results

CHAIR S (lower is better)

Value31.8

BaselineLLaVA-1.5 66.8

CHAIR I (lower is better)

Value5.4

BaselineLLaVA-1.5 12.7

POPE (higher is better)

Value86.90

BaselineLLaVA-1.5 85.90

LLaVA-Bench (higher is better)

Value68.7

BaselineLLaVA-1.5 63.4

Who Should Care

What To Try In 7 Days

Generate ~10k AI-edited dispreferred captions from a small ground-truth set using GPT-4V.

Implement DPO fine-tuning on your VLLM backbone for a few epochs on one A100-equivalent GPU.

Add lightweight image distortion during training to capture the model's inherent hallucinations.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on access to a strong multimodal editor (GPT-4V); quality depends on that model.
  • Evaluations are on LLaVA-1.5 backbone; results may vary on other architectures.
  • Choice and magnitude of image noise matter; poor settings can reduce benefit.

When Not To Use

  • You lack access to GPT-4V or similar multimodal editors.
  • You need human-verified labels for legal or medical auditing.
  • Your deployment cannot tolerate AI-generated negative examples without review.

Failure Modes

  • Overfitting to the synthetic dispreferences, harming unrelated behaviors.
  • Generated dispreferences could introduce systemic biases if GPT-4V is biased.
  • Image-noise settings too strong may push model away from valid visual cues.

Core Entities

Models

  • LLaVA-1.5 (7B)
  • Vicuna (7B)
  • GPT-4V
  • ViT-L

Metrics

  • CHAIR S
  • CHAIR I
  • POPE
  • MMHal
  • SciQA-IMG
  • LLaVA-Bench

Datasets

  • LLaVA-Instruct-150K (17K subset for dispreferences)
  • SciQA-IMG
  • MMbench
  • MM-Vet
  • LLaVA-Bench
  • MME

Benchmarks

  • CHAIR (S and I)
  • POPE
  • MMHal
  • SciQA-IMG
  • MM-Vet
  • MMBench
  • LLaVA-Bench