Reduce multimodal model hallucinations by learning from segment-level human corrections

Overview

Decision SnapshotReady For Pilot

Paper shows consistent hallucination drops across several benchmarks, human labels, and an efficiency story (1.4k prefs, <1h train), but larger-scale effects and long-tail failure modes need further testing.

Citations5

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RLHF-V makes multimodal models more trustworthy with far less labeled data and short retrain time, lowering risk when deploying vision-language assistants in customer-facing or safety-critical products.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

RLHF-V teaches multimodal LLMs to avoid image-based hallucinations by collecting fine-grained, segment-level human corrections and optimizing the model directly on those corrections via Dense Direct Preference Optimization (DDPO). With only 1.4k corrected samples, RLHF-V sharply cuts hallucination rates (34.8% on a human-eval benchmark) and matches or beats other open-source MLLMs while keeping helpfulness. The method is data-efficient, fast to train, and open-sourced.

Problem Statement

Multimodal LLMs often produce confident but wrong text about images (hallucinations). Coarse ranking feedback used in standard RLHF is ambiguous and data-hungry. The authors propose collecting fine-grained segment corrections and a direct optimization method to efficiently teach models what to change and what to keep.

Main Contribution

Collect a fine-grained human preference dataset of segment-level corrections for hallucinated output (1.4k prompts annotated).

Introduce DDPO (Dense Direct Preference Optimization), a DPO variant that weights corrected segments to exploit dense feedback.

Key Findings

Fine-grained corrections cut hallucinations on a human-eval benchmark

Numbers34.8% reduction on MHumanEval (object hallucination, 1.4k prefs)

Practical UseCollecting segment-level corrections yields large trust gains with small labeled sets; prefer corrections over coarse rankings when you can afford annotation.

Evidence RefAbstract; Section 4 main text; Figure 2

RLHF-V needs much less preference data than a concurrent RLHF approach

NumbersOutperforms LLaVA-RLHF trained on 10k prefs using only 1.4k prefs

Practical UseInvest in higher-quality, denser feedback instead of scaling coarse ranking labels to save annotation cost.

Evidence RefAbstract; Section 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Object HalBench (response-level hallucination)	12.2%	Muffin 50.5%	-38.3 pp	Object HalBench (augmented prompts)	Table 1: RLHF-V 12.2 vs Muffin 50.5	Table 1
Object HalBench (mention-level hallucination)	7.5%	Muffin 24.5%	-16.9 pp	Object HalBench	Table 1 mention-level: RLHF-V 7.5 vs Muffin 24.5	Table 1

What To Try In 7 Days

Collect a small set (~200–1.4k) of segment-level corrections on your worst hallucination cases.

Apply DDPO-style weighted fine-tuning (highlight corrected segments) on your current multimodal model.

Fine-tune on a trusted VQA dataset (e.g., VQAv2) to calibrate hallucination-prone behaviors before large-scale deployment.

Optimization Features

Infra Optimization

Fast training: <1 hour on 8 A100s for reported runs

Training Optimization

DDPO uses segment-weighted likelihood with γ=5Fine-tune on VQAv2 to counter noisy pretraining text

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/RLHF-V/RLHF-V

Data URLs

https://github.com/RLHF-V/RLHF-V

Risks & Boundaries

Limitations

Preference data is small (1.4k); broader coverage of scenes and languages is untested.

Method relies on human correction quality; noisy or inconsistent corrections would hurt results.

When Not To Use

When you need a model to produce richer, highly detailed descriptions beyond the base model capacity (distillation may worsen hallucinations).

If you cannot supply reliable segment-level human corrections for your domain.

Failure Modes

Over-correction: the model may omit plausible details to avoid hallucination.

Distribution shift: corrections collected on one dataset may not generalize to other visual domains.

Core Entities

Models

RLHF-V (this paper)Muffin (base)LLaVA-RLHFLLaVAInstructBLIPQwen-VL-ChatGPT-4V

Metrics

response-level hallucination ratemention-level hallucination rateinformativeness (GPT-4 score)Accuracy

Datasets

Object HalBenchMMHal-BenchMHumanEval (constructed)VQAv2

Benchmarks

Object HalBenchMMHal-BenchMHumanEvalLLaVA BenchVQAv2

Context Entities

Models

OmniLMM-12B (applied with RLHF-V pipeline)

Datasets

COCO (used for Object HalBench sampling)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-grained corrections cut hallucinations on a human-eval benchmark

RLHF-V needs much less preference data than a concurrent RLHF approach

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

Key finding

MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

Key finding

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

Key finding

FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Key finding

Train LLMs to say “I don't know”: integrate unanswerability detection and RLHF to cut hallucinations to ~1%

Key finding