CHiP: reduce image-driven hallucinations by learning preferences over images and fine-grained text

January 28, 20257 min

Overview

Decision SnapshotReady For Pilot

Authors demonstrate large hallucination drops on four standard benchmarks with ablations and human checks; results rely on a 5k preference dataset and medium compute (4 H100s).

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CHiP meaningfully lowers image-driven hallucinations with modest additional training, so vision–language products can give fewer incorrect claims without rebuilding models.

Who Should Care

Summary TLDR

CHiP extends Direct Preference Optimization (DPO) to multimodal models by adding two modules: visual preference optimization and hierarchical textual preference optimization (response, segment, token levels). Trained on a 5k RLHF-V dataset and tested on four hallucination benchmarks, CHiP substantially lowers object hallucination versus base models and standard DPO, improves image–text representation alignment, and keeps general capabilities largely intact. Training is modest (3–5 hours on 4 H100s in authors' runs). Code and data are released.

Problem Statement

Multimodal LLMs still hallucinate objects or details not present in images because text and image representations are misaligned and coarse response-level preference signals miss which words are wrong. Simply applying DPO to multimodal models does not fix this misalignment or fine-grained errors.

Main Contribution

Identify that multimodal DPO fails to align image and text representations and cannot clearly separate hallucinated vs non-hallucinated text.

Propose CHiP: combines visual preference optimization with hierarchical textual preference optimization at response, segment, and token levels.

Key Findings

CHiP reduces object-hallucination rate on ObjHal vs DPO.

NumbersMuffin: R. 13.1 -> 6.2 (52.7% relative drop vs DPO); LLaVA: 11.0 -> 4.9 (55.5% vs DPO).

Practical UseIf you already use DPO, add CHiP to cut object-hallucination roughly in half on ObjHal-like tasks.

Evidence RefTable 1 (ObjHal rows for Muffin and LLaVA; comparisons +DPO vs +CHiP)

CHiP yields larger reductions versus base (no alignment) models.

NumbersMuffin base R. 21.5 -> +CHiP 6.2 (≈71% rel reduction); LLaVA base 14.1 -> 4.9 (≈65% rel).

Practical UseApplying CHiP to off-the-shelf MLLMs gives big practical wins in lowering hallucinations without wholesale model replacement.

Evidence RefTable 1 (base vs +CHiP rows under ObjHal)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ObjHal response-level hallucination (R.)Muffin +CHiP 6.2%Muffin base 21.5%-71.2% relative vs baseObjHal (300 instances)Table 1, Muffin rowsTable 1
ObjHal response-level hallucination (R.)LLaVA-1.6 +CHiP 4.9%LLaVA-1.6 base 14.1%-65.3% relative vs baseObjHalTable 1, LLaVA rowsTable 1

What To Try In 7 Days

Run CHiP-style training on a copy of your MLLM using a small set (~5k) of preference pairs to see hallucination change.

Create rejection images with modest corruptions (diffusion or crop) rather than random blanks; validate which works for your data.

Tune segment-weight λ and token-weight γ; authors found λ≈1–3 and γ=0.1 work well.

Optimization Features

Model Optimization
preference-based fine-tuning
Training Optimization
hierarchical textual preference (response/segment/token)visual preference pairs (near-image negatives)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

RLHF-V-Dataset (referenced in paper)

Risks & Boundaries

Limitations

Requires paired preference data (authors used RLHF-V 5k).

Performance depends on how rejection images are built (diffusion/crop recommended).

When Not To Use

You lack any preference-labeled multimodal data.

You cannot afford extra fine-tuning steps or validation runs.

Failure Modes

May lower some logical-consistency metrics (fA) due to multi-objective trade-offs.

If rejection images are too different, visual preference signal is noisy and harms learning.

Core Entities

Models

LLaVA-1.6 (7B)Muffin (13B)Vicuna-1.5-7BVicuna-13B

Metrics

response-level hallucination (R.)mention-level hallucination (M.)CHAIRCoverCogOverall (MMHal)qA / fA / aA

Datasets

RLHF-V-Dataset (5k)COCO-2017

Benchmarks

Object HalBench (ObjHal)MMHal-Bench (MMHal)HallusionBenchAMBER