Fixing DPO for images by training image-conditioned preferences and anchoring chosen answers

June 17, 20246 min

Overview

Decision SnapshotReady For Pilot

Method is a modest objective change with low compute overhead (LoRA). Evidence spans two base models, three benchmarks, and human eval but remains limited to selected datasets and architectures.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

mDPO cuts image-based hallucinations and raises answer quality, lowering risk in user-facing multimodal features and reducing rework from incorrect outputs.

Who Should Care

Summary TLDR

Standard Direct Preference Optimization (DPO) for multimodal LLMs often ignores images and overfits language cues, increasing hallucinations. mDPO adds (1) a conditional preference loss that contrasts the chosen image with a hard negative image and (2) an anchor that forces chosen answers to keep positive reward. Across two base models and three hallucination-focused benchmarks, mDPO reduces hallucination and improves overall quality, with strong ablation evidence that the conditional image objective is the main driver.

Problem Statement

Applying DPO to multimodal models can lead the model to ignore the image (an "unconditional preference"), producing language-only answers and more hallucinations. DPO can also reduce the likelihood of preferred (chosen) responses while enlarging preference gaps.

Main Contribution

Identify 'unconditional preference': multimodal DPO can ignore image input and learn language-only preferences.

Propose mDPO: add conditional image preference pairs and a reward anchor to keep chosen-response likelihood positive.

Key Findings

mDPO improves overall MMHalBench score for Bunny-3B vs DPO.

NumbersMMHalBench score +0.68 (DPO 2.28 → mDPO 2.96)

Practical UseIf you finetune a 3B multimodal model with mDPO you can noticeably raise answer quality on this benchmark versus vanilla DPO.

Evidence RefTable 1

mDPO reduces measured hallucination rate on MMHalBench for Bunny-3B.

NumbersHalRate −0.14 (DPO 0.56 → mDPO 0.42)

Practical UseUse mDPO to cut hallucinated responses in image QA tasks—reduces risky, image-incorrect outputs.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MMHalBench overall score (Bunny-3B)mDPO 2.96 vs DPO 2.28DPO 2.28+0.68MMHalBenchTable 1 reports scoresTable 1
MMHalBench HalRate (Bunny-3B)mDPO 0.42 vs DPO 0.56DPO 0.56−0.14MMHalBenchTable 1 reports hallucination ratesTable 1

What To Try In 7 Days

Run DPO finetuning baseline on 10k preference pairs (LoRA) to get a baseline.

Add mDPO image-conditioned pairs by cropping 0–20% to create hard negative images.

Include a simple reward anchor (δ=0) for chosen responses and compare hallucination rates on a held-out set.

Optimization Features

Training Optimization
preference fine-tuning (DPO variant)LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Silkie (Li et al., 2023) - sampled 10K preference pairsMMHalBench, Object HalBench, AMBER (public benchmarks referenced)

Risks & Boundaries

Limitations

Tested on two base models and three benchmarks; broader model-architecture coverage is missing.

Default negative-image strategy (crop 0–20%) may not suit all image types or tasks.

When Not To Use

If your task is text-only (no images), mDPO is unnecessary.

When you cannot construct or afford preference pairs that include image-conditioned labels.

Failure Modes

Too-strong anchoring could bias outputs or reduce diversity of valid responses.

Hard-negative images created by naive cropping may be insufficient for complex scenes.

Core Entities

Models

Bunny-v1.0-3BLLaVA-v1.5-7B

Metrics

MMHalBench overall score (0–6)HalRate (hallucination rate)CHAIR_s (response-level hallucination)CHAIR_i (object-level hallucination)object coveragehuman preference (pairwise)

Datasets

Silkie (LLaVA-Instruct subset, 10K sampled)

Benchmarks

MMHalBenchObject HalBenchAMBER

Context Entities

Models

GPT-4VLLaVA-v1.5-13BQwen-VL-Chat

Metrics

CHAIRGPT-4 scoring

Datasets

Silkie (full)LLaVA-Instruct-150K

Benchmarks

MMHalBenchObject HalBenchAMBER