Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
CHiP meaningfully lowers image-driven hallucinations with modest additional training, so vision–language products can give fewer incorrect claims without rebuilding models.
Summary TLDR
CHiP extends Direct Preference Optimization (DPO) to multimodal models by adding two modules: visual preference optimization and hierarchical textual preference optimization (response, segment, token levels). Trained on a 5k RLHF-V dataset and tested on four hallucination benchmarks, CHiP substantially lowers object hallucination versus base models and standard DPO, improves image–text representation alignment, and keeps general capabilities largely intact. Training is modest (3–5 hours on 4 H100s in authors' runs). Code and data are released.
Problem Statement
Multimodal LLMs still hallucinate objects or details not present in images because text and image representations are misaligned and coarse response-level preference signals miss which words are wrong. Simply applying DPO to multimodal models does not fix this misalignment or fine-grained errors.
Main Contribution
Identify that multimodal DPO fails to align image and text representations and cannot clearly separate hallucinated vs non-hallucinated text.
Propose CHiP: combines visual preference optimization with hierarchical textual preference optimization at response, segment, and token levels.
Show across ObjHal, MMHal, HallusionBench, and AMBER that CHiP reduces hallucinations versus base models and DPO and improves image–text alignment; release code and datasets.
Key Findings
CHiP reduces object-hallucination rate on ObjHal vs DPO.
CHiP yields larger reductions versus base (no alignment) models.
CHiP improves image–text representation alignment qualitatively.
Choice of rejection-image construction matters; diffusion-based negatives work best.
Training cost is modest in author experiments.
Results
ObjHal response-level hallucination (R.)
ObjHal response-level hallucination (R.)
ObjHal response-level hallucination (R.) vs DPO
ObjHal response-level hallucination (R.) vs DPO
AMBER Cog (human-cognition hallucination)
Who Should Care
What To Try In 7 Days
Run CHiP-style training on a copy of your MLLM using a small set (~5k) of preference pairs to see hallucination change.
Create rejection images with modest corruptions (diffusion or crop) rather than random blanks; validate which works for your data.
Tune segment-weight λ and token-weight γ; authors found λ≈1–3 and γ=0.1 work well.
Optimization Features
Model Optimization
- preference-based fine-tuning
Training Optimization
- hierarchical textual preference (response/segment/token)
- visual preference pairs (near-image negatives)
Reproducibility
Code Urls
Data Urls
- RLHF-V-Dataset (referenced in paper)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Requires paired preference data (authors used RLHF-V 5k).
- Performance depends on how rejection images are built (diffusion/crop recommended).
- Combining multiple objectives can dilute gains when training the visual encoder end-to-end.
When Not To Use
- You lack any preference-labeled multimodal data.
- You cannot afford extra fine-tuning steps or validation runs.
- Your application requires maximal object coverage/recall over conservative omission.
Failure Modes
- May lower some logical-consistency metrics (fA) due to multi-objective trade-offs.
- If rejection images are too different, visual preference signal is noisy and harms learning.
- Jointly training the visual encoder with CHiP can dilute image-text alignment under some settings.
Core Entities
Models
- LLaVA-1.6 (7B)
- Muffin (13B)
- Vicuna-1.5-7B
- Vicuna-13B
Metrics
- response-level hallucination (R.)
- mention-level hallucination (M.)
- CHAIR
- Cover
- Cog
- Overall (MMHal)
- qA / fA / aA
Datasets
- RLHF-V-Dataset (5k)
- COCO-2017
Benchmarks
- Object HalBench (ObjHal)
- MMHal-Bench (MMHal)
- HallusionBench
- AMBER

