Overview
Method is a modest objective change with low compute overhead (LoRA). Evidence spans two base models, three benchmarks, and human eval but remains limited to selected datasets and architectures.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
mDPO cuts image-based hallucinations and raises answer quality, lowering risk in user-facing multimodal features and reducing rework from incorrect outputs.
Who Should Care
Summary TLDR
Standard Direct Preference Optimization (DPO) for multimodal LLMs often ignores images and overfits language cues, increasing hallucinations. mDPO adds (1) a conditional preference loss that contrasts the chosen image with a hard negative image and (2) an anchor that forces chosen answers to keep positive reward. Across two base models and three hallucination-focused benchmarks, mDPO reduces hallucination and improves overall quality, with strong ablation evidence that the conditional image objective is the main driver.
Problem Statement
Applying DPO to multimodal models can lead the model to ignore the image (an "unconditional preference"), producing language-only answers and more hallucinations. DPO can also reduce the likelihood of preferred (chosen) responses while enlarging preference gaps.
Main Contribution
Identify 'unconditional preference': multimodal DPO can ignore image input and learn language-only preferences.
Propose mDPO: add conditional image preference pairs and a reward anchor to keep chosen-response likelihood positive.
Key Findings
mDPO improves overall MMHalBench score for Bunny-3B vs DPO.
mDPO reduces measured hallucination rate on MMHalBench for Bunny-3B.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MMHalBench overall score (Bunny-3B) | mDPO 2.96 vs DPO 2.28 | DPO 2.28 | +0.68 | MMHalBench | Table 1 reports scores | Table 1 |
| MMHalBench HalRate (Bunny-3B) | mDPO 0.42 vs DPO 0.56 | DPO 0.56 | −0.14 | MMHalBench | Table 1 reports hallucination rates | Table 1 |
What To Try In 7 Days
Run DPO finetuning baseline on 10k preference pairs (LoRA) to get a baseline.
Add mDPO image-conditioned pairs by cropping 0–20% to create hard negative images.
Include a simple reward anchor (δ=0) for chosen responses and compare hallucination rates on a held-out set.
Optimization Features
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Tested on two base models and three benchmarks; broader model-architecture coverage is missing.
Default negative-image strategy (crop 0–20%) may not suit all image types or tasks.
When Not To Use
If your task is text-only (no images), mDPO is unnecessary.
When you cannot construct or afford preference pairs that include image-conditioned labels.
Failure Modes
Too-strong anchoring could bias outputs or reduce diversity of valid responses.
Hard-negative images created by naive cropping may be insufficient for complex scenes.

