MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

February 14, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan

Links

Abstract / PDF

Why It Matters For Business

MM-RLHF provides large, human-quality preference data and practical training recipes that reduce unsafe outputs and boost conversation quality, so teams can make multimodal products more reliable without depending only on massive closed-source reward models.

Summary TLDR

This paper introduces MM-RLHF, a 120k human-annotated multimodal preference dataset and two alignment contributions: a critique-based reward model (MM-RLHF-Reward-7B) and MM-DPO (DPO with Dynamic Reward Scaling). The dataset comes from 10M raw samples, resampled to ~30k queries and annotated into 120k ranked pairs. On 27 benchmarks, alignment with MM-RLHF plus MM-DPO improves conversational scores (~11% average on evaluated benchmarks) and cuts unsafe behavior (~57% reduction on evaluated safety metrics). The 7B reward model yields strong open-source reward signals (ACC/ACC+ up to 0.85/0.67 overall) and enables instance-level beta scaling during training. Practical outcome: use the dataset +

Problem Statement

State-of-the-art multimodal LLMs rarely receive rigorous alignment to human preferences. Existing alignment work often targets isolated problems (e.g., hallucination) and small datasets (<10k). The field lacks a large, fine-grained multimodal RLHF dataset and practical reward/optimization methods to scale alignment across vision, video, safety, reasoning, and conversation.

Main Contribution

MM-RLHF: a human-annotated multimodal preference dataset with 120k ranked comparison pairs sampled from 10M raw instances and ~30k representative queries.

Critique-Based Reward Model: train a reward model to first generate a critique (explain) then score outputs, using GPT-4o to expand human rationales for supervision.

MM-DPO: Direct Preference Optimization with Dynamic Reward Scaling that weights training pairs by reward margin to prioritize high-confidence comparisons.

Two benchmarks: MM-RLHF-RewardBench (reward model test set) and MM-RLHF-SafetyBench (safety/adversarial evaluation).

Extensive experiments across 27 benchmarks showing broad gains in conversation, safety, hallucination reduction, reasoning, and video/multi-image tasks.

Key Findings

Dataset scale and construction

Numbers120k ranked pairs; sampled from 10M raw instances and ~30k queries

Critique-based reward training improves ranking robustness

NumbersACC improved to 0.85, ACC+ to 0.67 (overall) for MM-RLHF-Reward

Expanding human rationales with an LLM helps reward training

NumbersACC+ up by ~17% versus baseline when using enhanced annotations

Instance-level dynamic beta (MM-DPO) prioritizes informative pairs

NumbersAlignment yields ~11% average conversational gains and ~57% reduction in unsafe behavior on evaluated benchmarks

A compact 7B reward model performs strongly

NumbersMM-RLHF-Reward-7B average 50.15 (Table 5 avg); outperforms many open-source 72B models on several metrics

Results

Dataset size

Value120k ranked pairs (from ~30k queries)

Baselineprior multimodal RLHF datasets (<10k)

Accuracy

ValueACC 0.85, ACC+ 0.67

BaselineLLaVA-Critic overall ACC 0.45, ACC+ 0.17

Reward model average score (multi-dim)

ValueAvg 50.15

Baselinemany open-source 72B models: 36–46 range

Conversational ability (avg on evaluated benchmarks)

Value~11% improvement

Baselinebaseline models before alignment

Unsafe behavior reduction (safety metrics)

Value~57% reduction in unsafe behavior

Baselinebaseline before alignment

Who Should Care

What To Try In 7 Days

Visit project page and inspect dataset samples and annotation guidelines (project page URL).

Download a small MM-RLHF sample and train MM-RLHF-Reward-7B or a local 7B critic on a subset.

Run MM-DPO on a small LLaVA-OV-7B checkpoint with dynamic beta and monitor ACC/ACC+ on a held-out set.

Agent Features

Tool Use

  • uses LLMs for annotation expansion
  • uses reward model in training loop

Frameworks

  • DPO
  • MM-DPO

Optimization Features

Infra Optimization

  • Vision encoder frozen during alignment to reduce compute

Model Optimization

  • Critique head + scoring head (joint training) for reward model

Training Optimization

  • Dynamic Reward Scaling: per-pair β(δ) bounded in [β_ori, (1+w)β_ori]
  • SFT

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High human annotation cost and two-month annotation effort limits quick scaling
  • Limited ultra-high-resolution image coverage, so high-res benchmarks saw little gain
  • Instance-level dynamic beta requires a strong external reward model to be reliable
  • Small MLLMs (<7B) struggle to self-improve via sampling due to capacity and weak reward signals

When Not To Use

  • When you need alignment targeted at ultra-high-resolution images
  • When you cannot afford human annotation or compute to train reward models
  • If only a tiny labeled preference set is available and no external reward model can be built

Failure Modes

  • Reward model overfitting to conversational domains leading to poor signals on math/chart tasks
  • Incorrect or hallucinated critiques from the critic can mislead scoring
  • Over-aggressive β scaling causing unstable updates if reward margins are noisy
  • Bias from annotator mistakes or inconsistent rankings

Core Entities

Models

  • MM-RLHF-Reward-7B
  • LLaVA-OV-7B
  • LLaVA-OV-0.5B
  • InternVL-1B
  • Qwen2-VL-72B
  • GPT-4o
  • Claude-3.5-sonnet
  • LLaVA-Critic
  • LLaMA3.2-90B-Vision-Instruct

Metrics

  • ACC
  • ACC+
  • win rate
  • ASR (attack success rate)
  • RtA (reject-to-attack)
  • percent improvement

Datasets

  • MM-RLHF
  • LLaVA-RLHF
  • VLFeedback
  • LLAVA-OneVision
  • UniMM-Chat
  • SharedGPT-4 video
  • VLGuard

Benchmarks

  • MM-RLHF-RewardBench
  • MM-RLHF-SafetyBench
  • MME
  • MMBench
  • VQAv2
  • POPE
  • MMHal-Bench