Overview
Authors demonstrate large hallucination drops on four standard benchmarks with ablations and human checks; results rely on a 5k preference dataset and medium compute (4 H100s).
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
CHiP meaningfully lowers image-driven hallucinations with modest additional training, so vision–language products can give fewer incorrect claims without rebuilding models.
Who Should Care
Summary TLDR
CHiP extends Direct Preference Optimization (DPO) to multimodal models by adding two modules: visual preference optimization and hierarchical textual preference optimization (response, segment, token levels). Trained on a 5k RLHF-V dataset and tested on four hallucination benchmarks, CHiP substantially lowers object hallucination versus base models and standard DPO, improves image–text representation alignment, and keeps general capabilities largely intact. Training is modest (3–5 hours on 4 H100s in authors' runs). Code and data are released.
Problem Statement
Multimodal LLMs still hallucinate objects or details not present in images because text and image representations are misaligned and coarse response-level preference signals miss which words are wrong. Simply applying DPO to multimodal models does not fix this misalignment or fine-grained errors.
Main Contribution
Identify that multimodal DPO fails to align image and text representations and cannot clearly separate hallucinated vs non-hallucinated text.
Propose CHiP: combines visual preference optimization with hierarchical textual preference optimization at response, segment, and token levels.
Key Findings
CHiP reduces object-hallucination rate on ObjHal vs DPO.
CHiP yields larger reductions versus base (no alignment) models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ObjHal response-level hallucination (R.) | Muffin +CHiP 6.2% | Muffin base 21.5% | -71.2% relative vs base | ObjHal (300 instances) | Table 1, Muffin rows | Table 1 |
| ObjHal response-level hallucination (R.) | LLaVA-1.6 +CHiP 4.9% | LLaVA-1.6 base 14.1% | -65.3% relative vs base | ObjHal | Table 1, LLaVA rows | Table 1 |
What To Try In 7 Days
Run CHiP-style training on a copy of your MLLM using a small set (~5k) of preference pairs to see hallucination change.
Create rejection images with modest corruptions (diffusion or crop) rather than random blanks; validate which works for your data.
Tune segment-weight λ and token-weight γ; authors found λ≈1–3 and γ=0.1 work well.
Optimization Features
Model Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Requires paired preference data (authors used RLHF-V 5k).
Performance depends on how rejection images are built (diffusion/crop recommended).
When Not To Use
You lack any preference-labeled multimodal data.
You cannot afford extra fine-tuning steps or validation runs.
Failure Modes
May lower some logical-consistency metrics (fA) due to multi-objective trade-offs.
If rejection images are too different, visual preference signal is noisy and harms learning.

