Reduce multimodal model hallucinations by learning from segment-level human corrections

December 1, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

5

Authors

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua

Links

Abstract / PDF

Why It Matters For Business

RLHF-V makes multimodal models more trustworthy with far less labeled data and short retrain time, lowering risk when deploying vision-language assistants in customer-facing or safety-critical products.

Summary TLDR

RLHF-V teaches multimodal LLMs to avoid image-based hallucinations by collecting fine-grained, segment-level human corrections and optimizing the model directly on those corrections via Dense Direct Preference Optimization (DDPO). With only 1.4k corrected samples, RLHF-V sharply cuts hallucination rates (34.8% on a human-eval benchmark) and matches or beats other open-source MLLMs while keeping helpfulness. The method is data-efficient, fast to train, and open-sourced.

Problem Statement

Multimodal LLMs often produce confident but wrong text about images (hallucinations). Coarse ranking feedback used in standard RLHF is ambiguous and data-hungry. The authors propose collecting fine-grained segment corrections and a direct optimization method to efficiently teach models what to change and what to keep.

Main Contribution

Collect a fine-grained human preference dataset of segment-level corrections for hallucinated output (1.4k prompts annotated).

Introduce DDPO (Dense Direct Preference Optimization), a DPO variant that weights corrected segments to exploit dense feedback.

Show significant hallucination reduction and robustness across multiple multimodal benchmarks with small annotation budgets and fast training.

Key Findings

Fine-grained corrections cut hallucinations on a human-eval benchmark

Numbers34.8% reduction on MHumanEval (object hallucination, 1.4k prefs)

RLHF-V needs much less preference data than a concurrent RLHF approach

NumbersOutperforms LLaVA-RLHF trained on 10k prefs using only 1.4k prefs

RLHF-V is robust to scene-based over-generalization

NumbersAverage scene hallucination change ∆ = 1.7 (RLHF-V) vs larger ∆ for baselines (Table 2)

Training is computationally light

Numbers<1 hour on 8 A100 GPUs (DDPO, 7 epochs)

Results

Object HalBench (response-level hallucination)

Value12.2%

BaselineMuffin 50.5%

Object HalBench (mention-level hallucination)

Value7.5%

BaselineMuffin 24.5%

MHumanEval (overall response-level hallucination)

Value55.5%

BaselineMuffin 74.7%

Accuracy

Value80.0

Scene over-generalization (avg ∆ change)

Value1.7

BaselineGPT-4V ∆ ~5.0

Who Should Care

What To Try In 7 Days

Collect a small set (~200–1.4k) of segment-level corrections on your worst hallucination cases.

Apply DDPO-style weighted fine-tuning (highlight corrected segments) on your current multimodal model.

Fine-tune on a trusted VQA dataset (e.g., VQAv2) to calibrate hallucination-prone behaviors before large-scale deployment.

Optimization Features

Infra Optimization

  • Fast training: <1 hour on 8 A100s for reported runs

Training Optimization

  • DDPO uses segment-weighted likelihood with γ=5
  • Fine-tune on VQAv2 to counter noisy pretraining text

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Preference data is small (1.4k); broader coverage of scenes and languages is untested.
  • Method relies on human correction quality; noisy or inconsistent corrections would hurt results.
  • Improving trustworthiness can reduce descriptive detail compared with larger models (distillation from stronger models increased hallucinations).

When Not To Use

  • When you need a model to produce richer, highly detailed descriptions beyond the base model capacity (distillation may worsen hallucinations).
  • If you cannot supply reliable segment-level human corrections for your domain.

Failure Modes

  • Over-correction: the model may omit plausible details to avoid hallucination.
  • Distribution shift: corrections collected on one dataset may not generalize to other visual domains.
  • Distilling from much stronger models can teach risky behaviors and increase hallucinations.

Core Entities

Models

  • RLHF-V (this paper)
  • Muffin (base)
  • LLaVA-RLHF
  • LLaVA
  • InstructBLIP
  • Qwen-VL-Chat
  • GPT-4V

Metrics

  • response-level hallucination rate
  • mention-level hallucination rate
  • informativeness (GPT-4 score)
  • Accuracy

Datasets

  • Object HalBench
  • MMHal-Bench
  • MHumanEval (constructed)
  • VQAv2

Benchmarks

  • Object HalBench
  • MMHal-Bench
  • MHumanEval
  • LLaVA Bench
  • VQAv2

Context Entities

Models

  • OmniLMM-12B (applied with RLHF-V pipeline)

Datasets

  • COCO (used for Object HalBench sampling)