Reduce multimodal model hallucinations by learning from segment-level human corrections

December 1, 20237 min

Overview

Decision SnapshotReady For Pilot

Paper shows consistent hallucination drops across several benchmarks, human labels, and an efficiency story (1.4k prefs, <1h train), but larger-scale effects and long-tail failure modes need further testing.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RLHF-V makes multimodal models more trustworthy with far less labeled data and short retrain time, lowering risk when deploying vision-language assistants in customer-facing or safety-critical products.

Who Should Care

Summary TLDR

RLHF-V teaches multimodal LLMs to avoid image-based hallucinations by collecting fine-grained, segment-level human corrections and optimizing the model directly on those corrections via Dense Direct Preference Optimization (DDPO). With only 1.4k corrected samples, RLHF-V sharply cuts hallucination rates (34.8% on a human-eval benchmark) and matches or beats other open-source MLLMs while keeping helpfulness. The method is data-efficient, fast to train, and open-sourced.

Problem Statement

Multimodal LLMs often produce confident but wrong text about images (hallucinations). Coarse ranking feedback used in standard RLHF is ambiguous and data-hungry. The authors propose collecting fine-grained segment corrections and a direct optimization method to efficiently teach models what to change and what to keep.

Main Contribution

Collect a fine-grained human preference dataset of segment-level corrections for hallucinated output (1.4k prompts annotated).

Introduce DDPO (Dense Direct Preference Optimization), a DPO variant that weights corrected segments to exploit dense feedback.

Key Findings

Fine-grained corrections cut hallucinations on a human-eval benchmark

Numbers34.8% reduction on MHumanEval (object hallucination, 1.4k prefs)

Practical UseCollecting segment-level corrections yields large trust gains with small labeled sets; prefer corrections over coarse rankings when you can afford annotation.

Evidence RefAbstract; Section 4 main text; Figure 2

RLHF-V needs much less preference data than a concurrent RLHF approach

NumbersOutperforms LLaVA-RLHF trained on 10k prefs using only 1.4k prefs

Practical UseInvest in higher-quality, denser feedback instead of scaling coarse ranking labels to save annotation cost.

Evidence RefAbstract; Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Object HalBench (response-level hallucination)12.2%Muffin 50.5%-38.3 ppObject HalBench (augmented prompts)Table 1: RLHF-V 12.2 vs Muffin 50.5Table 1
Object HalBench (mention-level hallucination)7.5%Muffin 24.5%-16.9 ppObject HalBenchTable 1 mention-level: RLHF-V 7.5 vs Muffin 24.5Table 1

What To Try In 7 Days

Collect a small set (~200–1.4k) of segment-level corrections on your worst hallucination cases.

Apply DDPO-style weighted fine-tuning (highlight corrected segments) on your current multimodal model.

Fine-tune on a trusted VQA dataset (e.g., VQAv2) to calibrate hallucination-prone behaviors before large-scale deployment.

Optimization Features

Infra Optimization
Fast training: <1 hour on 8 A100s for reported runs
Training Optimization
DDPO uses segment-weighted likelihood with γ=5Fine-tune on VQAv2 to counter noisy pretraining text

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Preference data is small (1.4k); broader coverage of scenes and languages is untested.

Method relies on human correction quality; noisy or inconsistent corrections would hurt results.

When Not To Use

When you need a model to produce richer, highly detailed descriptions beyond the base model capacity (distillation may worsen hallucinations).

If you cannot supply reliable segment-level human corrections for your domain.

Failure Modes

Over-correction: the model may omit plausible details to avoid hallucination.

Distribution shift: corrections collected on one dataset may not generalize to other visual domains.

Core Entities

Models

RLHF-V (this paper)Muffin (base)LLaVA-RLHFLLaVAInstructBLIPQwen-VL-ChatGPT-4V

Metrics

response-level hallucination ratemention-level hallucination rateinformativeness (GPT-4 score)Accuracy

Datasets

Object HalBenchMMHal-BenchMHumanEval (constructed)VQAv2

Benchmarks

Object HalBenchMMHal-BenchMHumanEvalLLaVA BenchVQAv2

Context Entities

Models

OmniLMM-12B (applied with RLHF-V pipeline)

Datasets

COCO (used for Object HalBench sampling)