Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
mDPO cuts image-based hallucinations and raises answer quality, lowering risk in user-facing multimodal features and reducing rework from incorrect outputs.
Summary TLDR
Standard Direct Preference Optimization (DPO) for multimodal LLMs often ignores images and overfits language cues, increasing hallucinations. mDPO adds (1) a conditional preference loss that contrasts the chosen image with a hard negative image and (2) an anchor that forces chosen answers to keep positive reward. Across two base models and three hallucination-focused benchmarks, mDPO reduces hallucination and improves overall quality, with strong ablation evidence that the conditional image objective is the main driver.
Problem Statement
Applying DPO to multimodal models can lead the model to ignore the image (an "unconditional preference"), producing language-only answers and more hallucinations. DPO can also reduce the likelihood of preferred (chosen) responses while enlarging preference gaps.
Main Contribution
Identify 'unconditional preference': multimodal DPO can ignore image input and learn language-only preferences.
Propose mDPO: add conditional image preference pairs and a reward anchor to keep chosen-response likelihood positive.
Show consistent improvements on three hallucination-focused benchmarks and across two model sizes; conditional preference provides the largest gain.
Key Findings
mDPO improves overall MMHalBench score for Bunny-3B vs DPO.
mDPO reduces measured hallucination rate on MMHalBench for Bunny-3B.
Human experts preferred or tied mDPO responses far more often.
Conditional image preference is the critical component.
DPO can perform similarly even when images are removed from training data.
Results
MMHalBench overall score (Bunny-3B)
MMHalBench HalRate (Bunny-3B)
MMHalBench overall score (LLaVA-7B)
Human pairwise preference (Bunny-3B)
Who Should Care
What To Try In 7 Days
Run DPO finetuning baseline on 10k preference pairs (LoRA) to get a baseline.
Add mDPO image-conditioned pairs by cropping 0–20% to create hard negative images.
Include a simple reward anchor (δ=0) for chosen responses and compare hallucination rates on a held-out set.
Optimization Features
Training Optimization
- preference fine-tuning (DPO variant)
- LoRA
Reproducibility
Code Urls
Data Urls
- Silkie (Li et al., 2023) - sampled 10K preference pairs
- MMHalBench, Object HalBench, AMBER (public benchmarks referenced)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Tested on two base models and three benchmarks; broader model-architecture coverage is missing.
- Default negative-image strategy (crop 0–20%) may not suit all image types or tasks.
- Anchors and hyperparameters need tuning per dataset; paper reports a single default (δ=0).
When Not To Use
- If your task is text-only (no images), mDPO is unnecessary.
- When you cannot construct or afford preference pairs that include image-conditioned labels.
- When real-time inference constraints prevent any finetuning or LoRA adaptation.
Failure Modes
- Too-strong anchoring could bias outputs or reduce diversity of valid responses.
- Hard-negative images created by naive cropping may be insufficient for complex scenes.
- mDPO may slightly reduce object coverage while lowering hallucinations, which can matter for tasks needing exhaustive mentions.
Core Entities
Models
- Bunny-v1.0-3B
- LLaVA-v1.5-7B
Metrics
- MMHalBench overall score (0–6)
- HalRate (hallucination rate)
- CHAIR_s (response-level hallucination)
- CHAIR_i (object-level hallucination)
- object coverage
- human preference (pairwise)
Datasets
- Silkie (LLaVA-Instruct subset, 10K sampled)
Benchmarks
- MMHalBench
- Object HalBench
- AMBER
Context Entities
Models
- GPT-4V
- LLaVA-v1.5-13B
- Qwen-VL-Chat
Metrics
- CHAIR
- GPT-4 scoring
Datasets
- Silkie (full)
- LLaVA-Instruct-150K
Benchmarks
- MMHalBench
- Object HalBench
- AMBER

