Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
MM-RLHF provides large, human-quality preference data and practical training recipes that reduce unsafe outputs and boost conversation quality, so teams can make multimodal products more reliable without depending only on massive closed-source reward models.
Summary TLDR
This paper introduces MM-RLHF, a 120k human-annotated multimodal preference dataset and two alignment contributions: a critique-based reward model (MM-RLHF-Reward-7B) and MM-DPO (DPO with Dynamic Reward Scaling). The dataset comes from 10M raw samples, resampled to ~30k queries and annotated into 120k ranked pairs. On 27 benchmarks, alignment with MM-RLHF plus MM-DPO improves conversational scores (~11% average on evaluated benchmarks) and cuts unsafe behavior (~57% reduction on evaluated safety metrics). The 7B reward model yields strong open-source reward signals (ACC/ACC+ up to 0.85/0.67 overall) and enables instance-level beta scaling during training. Practical outcome: use the dataset +
Problem Statement
State-of-the-art multimodal LLMs rarely receive rigorous alignment to human preferences. Existing alignment work often targets isolated problems (e.g., hallucination) and small datasets (<10k). The field lacks a large, fine-grained multimodal RLHF dataset and practical reward/optimization methods to scale alignment across vision, video, safety, reasoning, and conversation.
Main Contribution
MM-RLHF: a human-annotated multimodal preference dataset with 120k ranked comparison pairs sampled from 10M raw instances and ~30k representative queries.
Critique-Based Reward Model: train a reward model to first generate a critique (explain) then score outputs, using GPT-4o to expand human rationales for supervision.
MM-DPO: Direct Preference Optimization with Dynamic Reward Scaling that weights training pairs by reward margin to prioritize high-confidence comparisons.
Two benchmarks: MM-RLHF-RewardBench (reward model test set) and MM-RLHF-SafetyBench (safety/adversarial evaluation).
Extensive experiments across 27 benchmarks showing broad gains in conversation, safety, hallucination reduction, reasoning, and video/multi-image tasks.
Key Findings
Dataset scale and construction
Critique-based reward training improves ranking robustness
Expanding human rationales with an LLM helps reward training
Instance-level dynamic beta (MM-DPO) prioritizes informative pairs
A compact 7B reward model performs strongly
Results
Dataset size
Accuracy
Reward model average score (multi-dim)
Conversational ability (avg on evaluated benchmarks)
Unsafe behavior reduction (safety metrics)
Who Should Care
What To Try In 7 Days
Visit project page and inspect dataset samples and annotation guidelines (project page URL).
Download a small MM-RLHF sample and train MM-RLHF-Reward-7B or a local 7B critic on a subset.
Run MM-DPO on a small LLaVA-OV-7B checkpoint with dynamic beta and monitor ACC/ACC+ on a held-out set.
Agent Features
Tool Use
- uses LLMs for annotation expansion
- uses reward model in training loop
Frameworks
- DPO
- MM-DPO
Optimization Features
Infra Optimization
- Vision encoder frozen during alignment to reduce compute
Model Optimization
- Critique head + scoring head (joint training) for reward model
Training Optimization
- Dynamic Reward Scaling: per-pair β(δ) bounded in [β_ori, (1+w)β_ori]
- SFT
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High human annotation cost and two-month annotation effort limits quick scaling
- Limited ultra-high-resolution image coverage, so high-res benchmarks saw little gain
- Instance-level dynamic beta requires a strong external reward model to be reliable
- Small MLLMs (<7B) struggle to self-improve via sampling due to capacity and weak reward signals
When Not To Use
- When you need alignment targeted at ultra-high-resolution images
- When you cannot afford human annotation or compute to train reward models
- If only a tiny labeled preference set is available and no external reward model can be built
Failure Modes
- Reward model overfitting to conversational domains leading to poor signals on math/chart tasks
- Incorrect or hallucinated critiques from the critic can mislead scoring
- Over-aggressive β scaling causing unstable updates if reward margins are noisy
- Bias from annotator mistakes or inconsistent rankings
Core Entities
Models
- MM-RLHF-Reward-7B
- LLaVA-OV-7B
- LLaVA-OV-0.5B
- InternVL-1B
- Qwen2-VL-72B
- GPT-4o
- Claude-3.5-sonnet
- LLaVA-Critic
- LLaMA3.2-90B-Vision-Instruct
Metrics
- ACC
- ACC+
- win rate
- ASR (attack success rate)
- RtA (reject-to-attack)
- percent improvement
Datasets
- MM-RLHF
- LLaVA-RLHF
- VLFeedback
- LLAVA-OneVision
- UniMM-Chat
- SharedGPT-4 video
- VLGuard
Benchmarks
- MM-RLHF-RewardBench
- MM-RLHF-SafetyBench
- MME
- MMBench
- VQAv2
- POPE
- MMHal-Bench

