Make open-source multimodal models far more truthful using AI feedback and self-reward at inference

May 27, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, Xiaocheng Feng, Jun Song, Bo Zheng, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

RLAIF-V lets teams reduce multimodal hallucination without expensive human labeling or proprietary APIs, lowering alignment costs and improving product trust where visual accuracy matters.

Summary TLDR

RLAIF-V is a fully open-source pipeline that uses other open multimodal LLMs as labelers to generate high-quality pairwise feedback and then aligns models via direct preference optimization (DPO). Key practical moves: (1) deconfounded candidate generation (sample multiple responses with same decoding settings) to remove style noise; (2) divide-and-conquer claim-level scoring to let weaker labelers produce reliable judgments; (3) iterative feedback collection to avoid distribution shift; (4) use the aligned model itself as a reward at inference with length-normalized token scores and best-of-N selection. On multiple trustworthiness benchmarks, RLAIF-V cuts object hallucination massively (e.g.

Problem Statement

Open-source multimodal LLMs still hallucinate and prior feedback methods rely on costly humans or proprietary models. The community lacks a practical recipe to build human-quality feedback using only open-source MLLMs and to scale feedback to inference time.

Main Contribution

RLAIF-V: a fully open-source feedback alignment pipeline for multimodal LLMs that avoids proprietary labelers.

Deconfounded sampling to generate candidate responses under identical decoding conditions so style-related confounders are reduced.

Divide-and-conquer claim decomposition that converts each response into yes/no claim checks, enabling accurate scoring from weaker open-source labelers.

Self-feedback inference guidance: use the DPO-aligned model as a reward, apply token-level length normalization, and pick best-of-N samples at test time.

Key Findings

RLAIF-V 7B cuts object hallucination on Object HalBench by a large relative amount

Numbersobject hallucination reduced by 80.7% (Rsp. rate 54.5→10.5)

RLAIF-V 12B self-aligned model surpasses GPT-4V on several trust benchmarks

NumbersOmniLMM 12B Object HalBench Rsp. 19.4→4.5 (76.8% relative reduction); MHumanEval overall hallucination 52.7→35.6 (~32.4%

Divide-and-conquer claim scoring notably raises feedback quality compared with holistic scoring

Numbershuman agreement of constructed pairs: 96.7% (RLAIF-V) vs 66.7% (w/o divide-and-conquer)

Deconfounded sampling improves learning efficiency versus raw human annotations

NumbersObjHal. Rsp. 10.1 (RLAIF-V) vs 25.7 (w/o deconfounding)

Self-feedback reward helps improve inference outputs and reduces short-answer bias when length-normalized

NumbersBest-of-N selection increases average length difference from -7.7 words to +3.9 words (LLaVA 1.5 BoN with RLAIF-V 12B)

Results

Object HalBench response-level hallucination

Value10.5% (RLAIF-V 7B)

BaselineLLaVA 1.5: 54.5%

Object HalBench response-level hallucination

Value4.5% (RLAIF-V 12B self)

BaselineOmniLMM 12B: 19.4%

MHumanEval response-level hallucination

Value35.6% (RLAIF-V 12B)

BaselineOmniLMM 12B: 52.7%

Accuracy

Value80.5% (RLAIF-V)

BaselineVL-Feedback: 72.8%

Human agreement on constructed preference pairs

Value96.7% (RLAIF-V avec divide-and-conquer)

BaselineVL-Feedback: 92.3%; self-rewarding w/o d&c: 66.7%

Who Should Care

What To Try In 7 Days

Run the RLAIF-V repo and reproduce deconfounded candidate generation on a small instruction set

Implement claim splitting + yes/no claim scoring using an available open MLLM as labeler

Train or fine-tune with DPO for one iteration on 4k instructions and measure Object HalBench changes locally (dev split available in paper appendix viewable on repo).

Agent Features

Tool Use

  • best-of-N selection
  • nucleus sampling

Frameworks

  • Direct Preference Optimization (DPO)
  • iterative feedback learning

Optimization Features

Token Efficiency

  • length-normalization to avoid bias to short outputs

Infra Optimization

  • training and data collection reported using 8x A100 80G; 48–50h collection + 6–8h training

Training Optimization

  • iterative feedback collection to reduce distribution shift
  • deconfounded sampling to improve pairwise signal
  • divide-and-conquer claim scoring to lower labeler capacity needs

Inference Optimization

  • self-feedback reward from aligned model
  • best-of-N selection with length-normalized token rewards

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires non-trivial training and data-collection compute (authors report 8x A100 and tens of hours).
  • Still leaves residual hallucination; not a complete fix for all errors.
  • Evaluation uses GPT-4 as a comparator with prepared image descriptions, which may introduce judge bias.

When Not To Use

  • If you lack GPU resources to collect iterative feedback and fine-tune (cost-sensitive teams).
  • When you already rely on a trusted proprietary labeling pipeline and prefer that closed-loop instead.

Failure Modes

  • Labeler misjudgment: weaker open MLLMs can still produce incorrect claim scores if claims require external facts.
  • Reward bias toward short outputs without proper length-normalization.
  • Distribution shift if iterative feedback is not refreshed frequently.

Core Entities

Models

  • LLaVA 1.5 (7B)
  • LLaVA-NeXT (34B)
  • OmniLMM (12B)
  • RLAIF-V 7B
  • RLAIF-V 12B
  • GPT-4V

Metrics

  • response-level hallucination rate
  • mention-level hallucination rate
  • Accuracy
  • AMBER F1
  • trustworthiness win rate
  • overall win rate

Datasets

  • Object HalBench
  • MMHal-Bench
  • MHumanEval
  • AMBER
  • RefoMB
  • RLHF-V
  • MSCOCO
  • ShareGPT-4V
  • MovieNet
  • Google Landmark v2
  • VQA v2
  • OKVQA
  • TextVQA

Benchmarks

  • Object HalBench
  • MMHal-Bench
  • MHumanEval
  • AMBER
  • RefoMB
  • MMStar