CHiP: reduce image-driven hallucinations by learning preferences over images and fine-grained text

January 28, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng

Links

Abstract / PDF

Why It Matters For Business

CHiP meaningfully lowers image-driven hallucinations with modest additional training, so vision–language products can give fewer incorrect claims without rebuilding models.

Summary TLDR

CHiP extends Direct Preference Optimization (DPO) to multimodal models by adding two modules: visual preference optimization and hierarchical textual preference optimization (response, segment, token levels). Trained on a 5k RLHF-V dataset and tested on four hallucination benchmarks, CHiP substantially lowers object hallucination versus base models and standard DPO, improves image–text representation alignment, and keeps general capabilities largely intact. Training is modest (3–5 hours on 4 H100s in authors' runs). Code and data are released.

Problem Statement

Multimodal LLMs still hallucinate objects or details not present in images because text and image representations are misaligned and coarse response-level preference signals miss which words are wrong. Simply applying DPO to multimodal models does not fix this misalignment or fine-grained errors.

Main Contribution

Identify that multimodal DPO fails to align image and text representations and cannot clearly separate hallucinated vs non-hallucinated text.

Propose CHiP: combines visual preference optimization with hierarchical textual preference optimization at response, segment, and token levels.

Show across ObjHal, MMHal, HallusionBench, and AMBER that CHiP reduces hallucinations versus base models and DPO and improves image–text alignment; release code and datasets.

Key Findings

CHiP reduces object-hallucination rate on ObjHal vs DPO.

NumbersMuffin: R. 13.1 -> 6.2 (52.7% relative drop vs DPO); LLaVA: 11.0 -> 4.9 (55.5% vs DPO).

CHiP yields larger reductions versus base (no alignment) models.

NumbersMuffin base R. 21.5 -> +CHiP 6.2 (≈71% rel reduction); LLaVA base 14.1 -> 4.9 (≈65% rel).

CHiP improves image–text representation alignment qualitatively.

Choice of rejection-image construction matters; diffusion-based negatives work best.

NumbersObjHal R. (diffusion) 4.9 vs black 9.4 and random 10.9.

Training cost is modest in author experiments.

NumbersLLaVA+CHiP: ~3 hours on 4 H100s; Muffin+CHiP: ~5 hours on 4 H100s.

Results

ObjHal response-level hallucination (R.)

ValueMuffin +CHiP 6.2%

BaselineMuffin base 21.5%

ObjHal response-level hallucination (R.)

ValueLLaVA-1.6 +CHiP 4.9%

BaselineLLaVA-1.6 base 14.1%

ObjHal response-level hallucination (R.) vs DPO

ValueMuffin +CHiP 6.2% (DPO 13.1%)

BaselineMuffin +DPO 13.1%

ObjHal response-level hallucination (R.) vs DPO

ValueLLaVA +CHiP 4.9% (DPO 11.0%)

BaselineLLaVA +DPO 11.0%

AMBER Cog (human-cognition hallucination)

ValueMuffin +CHiP 1.5

BaselineMuffin base 3.5

Who Should Care

What To Try In 7 Days

Run CHiP-style training on a copy of your MLLM using a small set (~5k) of preference pairs to see hallucination change.

Create rejection images with modest corruptions (diffusion or crop) rather than random blanks; validate which works for your data.

Tune segment-weight λ and token-weight γ; authors found λ≈1–3 and γ=0.1 work well.

Optimization Features

Model Optimization

  • preference-based fine-tuning

Training Optimization

  • hierarchical textual preference (response/segment/token)
  • visual preference pairs (near-image negatives)

Reproducibility

Data Urls

  • RLHF-V-Dataset (referenced in paper)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires paired preference data (authors used RLHF-V 5k).
  • Performance depends on how rejection images are built (diffusion/crop recommended).
  • Combining multiple objectives can dilute gains when training the visual encoder end-to-end.

When Not To Use

  • You lack any preference-labeled multimodal data.
  • You cannot afford extra fine-tuning steps or validation runs.
  • Your application requires maximal object coverage/recall over conservative omission.

Failure Modes

  • May lower some logical-consistency metrics (fA) due to multi-objective trade-offs.
  • If rejection images are too different, visual preference signal is noisy and harms learning.
  • Jointly training the visual encoder with CHiP can dilute image-text alignment under some settings.

Core Entities

Models

  • LLaVA-1.6 (7B)
  • Muffin (13B)
  • Vicuna-1.5-7B
  • Vicuna-13B

Metrics

  • response-level hallucination (R.)
  • mention-level hallucination (M.)
  • CHAIR
  • Cover
  • Cog
  • Overall (MMHal)
  • qA / fA / aA

Datasets

  • RLHF-V-Dataset (5k)
  • COCO-2017

Benchmarks

  • Object HalBench (ObjHal)
  • MMHal-Bench (MMHal)
  • HallusionBench
  • AMBER