CHiP: reduce image-driven hallucinations by learning preferences over images and fine-grained text

Overview

Decision SnapshotReady For Pilot

Authors demonstrate large hallucination drops on four standard benchmarks with ablations and human checks; results rely on a 5k preference dataset and medium compute (4 H100s).

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CHiP meaningfully lowers image-driven hallucinations with modest additional training, so vision–language products can give fewer incorrect claims without rebuilding models.

Who Should Care

ML Engineer Product Manager CTO Data Scientist Founder

Summary TLDR

CHiP extends Direct Preference Optimization (DPO) to multimodal models by adding two modules: visual preference optimization and hierarchical textual preference optimization (response, segment, token levels). Trained on a 5k RLHF-V dataset and tested on four hallucination benchmarks, CHiP substantially lowers object hallucination versus base models and standard DPO, improves image–text representation alignment, and keeps general capabilities largely intact. Training is modest (3–5 hours on 4 H100s in authors' runs). Code and data are released.

Problem Statement

Multimodal LLMs still hallucinate objects or details not present in images because text and image representations are misaligned and coarse response-level preference signals miss which words are wrong. Simply applying DPO to multimodal models does not fix this misalignment or fine-grained errors.

Main Contribution

Identify that multimodal DPO fails to align image and text representations and cannot clearly separate hallucinated vs non-hallucinated text.

Propose CHiP: combines visual preference optimization with hierarchical textual preference optimization at response, segment, and token levels.

Key Findings

CHiP reduces object-hallucination rate on ObjHal vs DPO.

NumbersMuffin: R. 13.1 -> 6.2 (52.7% relative drop vs DPO); LLaVA: 11.0 -> 4.9 (55.5% vs DPO).

Practical UseIf you already use DPO, add CHiP to cut object-hallucination roughly in half on ObjHal-like tasks.

Evidence RefTable 1 (ObjHal rows for Muffin and LLaVA; comparisons +DPO vs +CHiP)

CHiP yields larger reductions versus base (no alignment) models.

NumbersMuffin base R. 21.5 -> +CHiP 6.2 (≈71% rel reduction); LLaVA base 14.1 -> 4.9 (≈65% rel).

Practical UseApplying CHiP to off-the-shelf MLLMs gives big practical wins in lowering hallucinations without wholesale model replacement.

Evidence RefTable 1 (base vs +CHiP rows under ObjHal)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ObjHal response-level hallucination (R.)	Muffin +CHiP 6.2%	Muffin base 21.5%	-71.2% relative vs base	ObjHal (300 instances)	Table 1, Muffin rows	Table 1
ObjHal response-level hallucination (R.)	LLaVA-1.6 +CHiP 4.9%	LLaVA-1.6 base 14.1%	-65.3% relative vs base	ObjHal	Table 1, LLaVA rows	Table 1

What To Try In 7 Days

Run CHiP-style training on a copy of your MLLM using a small set (~5k) of preference pairs to see hallucination change.

Create rejection images with modest corruptions (diffusion or crop) rather than random blanks; validate which works for your data.

Tune segment-weight λ and token-weight γ; authors found λ≈1–3 and γ=0.1 work well.

Optimization Features

Model Optimization

preference-based fine-tuning

Training Optimization

hierarchical textual preference (response/segment/token)visual preference pairs (near-image negatives)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/LVUGAI/CHiP

Data URLs

RLHF-V-Dataset (referenced in paper)

Risks & Boundaries

Limitations

Requires paired preference data (authors used RLHF-V 5k).

Performance depends on how rejection images are built (diffusion/crop recommended).

When Not To Use

You lack any preference-labeled multimodal data.

You cannot afford extra fine-tuning steps or validation runs.

Failure Modes

May lower some logical-consistency metrics (fA) due to multi-objective trade-offs.

If rejection images are too different, visual preference signal is noisy and harms learning.

Core Entities

Models

LLaVA-1.6 (7B)Muffin (13B)Vicuna-1.5-7BVicuna-13B

Metrics

response-level hallucination (R.)mention-level hallucination (M.)CHAIRCoverCogOverall (MMHal)qA / fA / aA

Datasets

RLHF-V-Dataset (5k)COCO-2017

Benchmarks

Object HalBench (ObjHal)MMHal-Bench (MMHal)HallusionBenchAMBER

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CHiP reduces object-hallucination rate on ObjHal vs DPO.

CHiP yields larger reductions versus base (no alignment) models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

APEMO: reallocate compute to negative peaks and endings to stabilize long-horizon agent workflows

Key finding

Practical comparison of DPO, KTO, IPO and CPO: KTO often wins, small preference sets suffice, instruction tuning helps truthfulness

Key finding

Optimize multi-agent LLM workflows with ScoreFlow: continuous, score-aware preference finetuning

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

SymMPO: use symmetric response pairs to reduce multimodal LLM hallucination with a theory-consistent DPO objective

Key finding