Fixing DPO for images by training image-conditioned preferences and anchoring chosen answers

June 17, 20246 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen

Links

Abstract / PDF

Why It Matters For Business

mDPO cuts image-based hallucinations and raises answer quality, lowering risk in user-facing multimodal features and reducing rework from incorrect outputs.

Summary TLDR

Standard Direct Preference Optimization (DPO) for multimodal LLMs often ignores images and overfits language cues, increasing hallucinations. mDPO adds (1) a conditional preference loss that contrasts the chosen image with a hard negative image and (2) an anchor that forces chosen answers to keep positive reward. Across two base models and three hallucination-focused benchmarks, mDPO reduces hallucination and improves overall quality, with strong ablation evidence that the conditional image objective is the main driver.

Problem Statement

Applying DPO to multimodal models can lead the model to ignore the image (an "unconditional preference"), producing language-only answers and more hallucinations. DPO can also reduce the likelihood of preferred (chosen) responses while enlarging preference gaps.

Main Contribution

Identify 'unconditional preference': multimodal DPO can ignore image input and learn language-only preferences.

Propose mDPO: add conditional image preference pairs and a reward anchor to keep chosen-response likelihood positive.

Show consistent improvements on three hallucination-focused benchmarks and across two model sizes; conditional preference provides the largest gain.

Key Findings

mDPO improves overall MMHalBench score for Bunny-3B vs DPO.

NumbersMMHalBench score +0.68 (DPO 2.28 → mDPO 2.96)

mDPO reduces measured hallucination rate on MMHalBench for Bunny-3B.

NumbersHalRate −0.14 (DPO 0.56 → mDPO 0.42)

Human experts preferred or tied mDPO responses far more often.

NumbersMDPO better-or-same 89% vs DPO better 11%

Conditional image preference is the critical component.

NumbersAblation: removing conditional objective drops score 2.96 → 2.36 and HalRate 0.42 → 0.53

DPO can perform similarly even when images are removed from training data.

Results

MMHalBench overall score (Bunny-3B)

ValuemDPO 2.96 vs DPO 2.28

BaselineDPO 2.28

MMHalBench HalRate (Bunny-3B)

ValuemDPO 0.42 vs DPO 0.56

BaselineDPO 0.56

MMHalBench overall score (LLaVA-7B)

ValuemDPO 2.39 vs DPO 2.14

BaselineDPO 2.14

Human pairwise preference (Bunny-3B)

ValuemDPO preferred or tied 89% of cases

BaselineDPO preferred 11%

Who Should Care

What To Try In 7 Days

Run DPO finetuning baseline on 10k preference pairs (LoRA) to get a baseline.

Add mDPO image-conditioned pairs by cropping 0–20% to create hard negative images.

Include a simple reward anchor (δ=0) for chosen responses and compare hallucination rates on a held-out set.

Optimization Features

Training Optimization

  • preference fine-tuning (DPO variant)
  • LoRA

Reproducibility

Data Urls

  • Silkie (Li et al., 2023) - sampled 10K preference pairs
  • MMHalBench, Object HalBench, AMBER (public benchmarks referenced)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Tested on two base models and three benchmarks; broader model-architecture coverage is missing.
  • Default negative-image strategy (crop 0–20%) may not suit all image types or tasks.
  • Anchors and hyperparameters need tuning per dataset; paper reports a single default (δ=0).

When Not To Use

  • If your task is text-only (no images), mDPO is unnecessary.
  • When you cannot construct or afford preference pairs that include image-conditioned labels.
  • When real-time inference constraints prevent any finetuning or LoRA adaptation.

Failure Modes

  • Too-strong anchoring could bias outputs or reduce diversity of valid responses.
  • Hard-negative images created by naive cropping may be insufficient for complex scenes.
  • mDPO may slightly reduce object coverage while lowering hallucinations, which can matter for tasks needing exhaustive mentions.

Core Entities

Models

  • Bunny-v1.0-3B
  • LLaVA-v1.5-7B

Metrics

  • MMHalBench overall score (0–6)
  • HalRate (hallucination rate)
  • CHAIR_s (response-level hallucination)
  • CHAIR_i (object-level hallucination)
  • object coverage
  • human preference (pairwise)

Datasets

  • Silkie (LLaVA-Instruct subset, 10K sampled)

Benchmarks

  • MMHalBench
  • Object HalBench
  • AMBER

Context Entities

Models

  • GPT-4V
  • LLaVA-v1.5-13B
  • Qwen-VL-Chat

Metrics

  • CHAIR
  • GPT-4 scoring

Datasets

  • Silkie (full)
  • LLaVA-Instruct-150K

Benchmarks

  • MMHalBench
  • Object HalBench
  • AMBER