Fixing DPO for images by training image-conditioned preferences and anchoring chosen answers

Overview

Decision SnapshotReady For Pilot

Method is a modest objective change with low compute overhead (LoRA). Evidence spans two base models, three benchmarks, and human eval but remains limited to selected datasets and architectures.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

mDPO cuts image-based hallucinations and raises answer quality, lowering risk in user-facing multimodal features and reducing rework from incorrect outputs.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

Standard Direct Preference Optimization (DPO) for multimodal LLMs often ignores images and overfits language cues, increasing hallucinations. mDPO adds (1) a conditional preference loss that contrasts the chosen image with a hard negative image and (2) an anchor that forces chosen answers to keep positive reward. Across two base models and three hallucination-focused benchmarks, mDPO reduces hallucination and improves overall quality, with strong ablation evidence that the conditional image objective is the main driver.

Problem Statement

Applying DPO to multimodal models can lead the model to ignore the image (an "unconditional preference"), producing language-only answers and more hallucinations. DPO can also reduce the likelihood of preferred (chosen) responses while enlarging preference gaps.

Main Contribution

Identify 'unconditional preference': multimodal DPO can ignore image input and learn language-only preferences.

Propose mDPO: add conditional image preference pairs and a reward anchor to keep chosen-response likelihood positive.

Key Findings

mDPO improves overall MMHalBench score for Bunny-3B vs DPO.

NumbersMMHalBench score +0.68 (DPO 2.28 → mDPO 2.96)

Practical UseIf you finetune a 3B multimodal model with mDPO you can noticeably raise answer quality on this benchmark versus vanilla DPO.

Evidence RefTable 1

mDPO reduces measured hallucination rate on MMHalBench for Bunny-3B.

NumbersHalRate −0.14 (DPO 0.56 → mDPO 0.42)

Practical UseUse mDPO to cut hallucinated responses in image QA tasks—reduces risky, image-incorrect outputs.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MMHalBench overall score (Bunny-3B)	mDPO 2.96 vs DPO 2.28	DPO 2.28	+0.68	MMHalBench	Table 1 reports scores	Table 1
MMHalBench HalRate (Bunny-3B)	mDPO 0.42 vs DPO 0.56	DPO 0.56	−0.14	MMHalBench	Table 1 reports hallucination rates	Table 1

What To Try In 7 Days

Run DPO finetuning baseline on 10k preference pairs (LoRA) to get a baseline.

Add mDPO image-conditioned pairs by cropping 0–20% to create hard negative images.

Include a simple reward anchor (δ=0) for chosen responses and compare hallucination rates on a held-out set.

Optimization Features

Training Optimization

preference fine-tuning (DPO variant)LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://feiwang96.github.io/mDPO

Data URLs

Silkie (Li et al., 2023) - sampled 10K preference pairsMMHalBench, Object HalBench, AMBER (public benchmarks referenced)

Risks & Boundaries

Limitations

Tested on two base models and three benchmarks; broader model-architecture coverage is missing.

Default negative-image strategy (crop 0–20%) may not suit all image types or tasks.

When Not To Use

If your task is text-only (no images), mDPO is unnecessary.

When you cannot construct or afford preference pairs that include image-conditioned labels.

Failure Modes

Too-strong anchoring could bias outputs or reduce diversity of valid responses.

Hard-negative images created by naive cropping may be insufficient for complex scenes.

Core Entities

Models

Bunny-v1.0-3BLLaVA-v1.5-7B

Metrics

MMHalBench overall score (0–6)HalRate (hallucination rate)CHAIR_s (response-level hallucination)CHAIR_i (object-level hallucination)object coveragehuman preference (pairwise)

Datasets

Silkie (LLaVA-Instruct subset, 10K sampled)

Benchmarks

MMHalBenchObject HalBenchAMBER

Context Entities

Models

GPT-4VLLaVA-v1.5-13BQwen-VL-Chat

Metrics

CHAIRGPT-4 scoring

Datasets

Silkie (full)LLaVA-Instruct-150K

Benchmarks

MMHalBenchObject HalBenchAMBER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

mDPO improves overall MMHalBench score for Bunny-3B vs DPO.

mDPO reduces measured hallucination rate on MMHalBench for Bunny-3B.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

APEMO: reallocate compute to negative peaks and endings to stabilize long-horizon agent workflows

Key finding

Practical comparison of DPO, KTO, IPO and CPO: KTO often wins, small preference sets suffice, instruction tuning helps truthfulness

Key finding

Optimize multi-agent LLM workflows with ScoreFlow: continuous, score-aware preference finetuning

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

SymMPO: use symmetric response pairs to reduce multimodal LLM hallucination with a theory-consistent DPO objective

Key finding