Survey: aligning diffusion models to human preferences — methods, benchmarks, and open problems

Overview

Decision SnapshotReady For Pilot

The survey synthesizes many recent methods and datasets and cites empirical comparisons, so guidance is actionable; however, empirical results vary by benchmark and reward model, so apply caution and human checks.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Aligning diffusion models cuts customer friction and reduces safety risks; aligned models produce outputs that match user intent and lower moderation costs.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This is a practical survey of how to align diffusion generative models with human intent. It reviews core alignment methods (RLHF, Direct Preference Optimization/DPO, and test-time techniques), catalogs major preference datasets and reward models, compares trade-offs (compute, stability, scalability), and lists open problems such as data scarcity, reward over-optimization, and benchmark bias. The paper highlights that DPO variants and test-time methods are rapidly adopted for images and other modalities, while RLHF remains powerful but costly and brittle.

Problem Statement

Diffusion models generate high-quality content but often miss user intent or produce undesirable outputs. Standard training optimizes likelihood, not human preference, so models can produce technically plausible but misaligned images. Aligning diffusion models requires new data formats, reward signals, and algorithms because generation is iterative, high-dimensional, and multimodal.

Main Contribution

Comprehensive review of alignment methods applied to diffusion models: RLHF, DPO, and test-time approaches

Catalog and comparison of major human-preference datasets and reward models for T2I

Key Findings

Alignment research is heavily concentrated on language models; diffusion model alignment is a small fraction.

NumbersLLMs: 89.4% of studies; diffusion models: 10.6% (Google Scholar, Jan 15, 2026)

Practical UseExpect fewer off-the-shelf alignment datasets and tools for diffusion models; plan extra effort to build or adapt preference data when working on image/video alignment.

Evidence RefIntroduction (Fig.2b)

Most reward models achieve under 80% pairwise preference prediction on standard benchmarks.

NumbersReward-model accuracies typically <80% across benchmarks (Table 4)

Practical UseReward models are imperfect evaluators; validate with human checks and avoid over-reliance on a single reward model during tuning.

Evidence RefSection 5.2.1 (Table 4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	HPS v2: 83.3% (on HPD v2)	—	—	HPD v2	Table 4 reports HPS v2 83.3% on HPD v2	Section 5.2.1 (Table 4)
Accuracy	MPS: 74.2% (on MHP)	—	—	MHP	Table 4 reports MPS 74.2% on MHP	Section 5.2.1 (Table 4)

What To Try In 7 Days

Run a small human preference study (100–500 pairs) on your core prompts to measure current alignment

Test test-time alignment: try prompt optimization and attention control to improve prompt-following without retraining

Train or fine-tune a simple DPO on collected pairs and compare reward scores plus human checks

Agent Features

Frameworks

RLHFDPO

Architectures

score-based diffusionlatent diffusion

Optimization Features

Infra Optimization

avoid full trajectory storage to reduce memory in RL fine-tuning

Model Optimization

Distillation for faster models (e.g., SD3-Turbo)LoRA

System Optimization

reuse of trajectories via importance samplingfew-step model alignment strategies

Training Optimization

Reward-weighted regressionKL regularizationgradient checkpointing

Inference Optimization

Test-time prompt optimizationattention controlinitial noise optimization

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/xie-lab-ml/awesome-alignment-of-diffusion-models

Data URLs

HPD v1 / HPD v2 (referenced datasets)Pick-a-Pic v1ImageRewardDBMHPRichHF-18KVisionPreferDiffusionDB

Risks & Boundaries

Limitations

Survey summarizes many methods but does not provide new empirical benchmarks

Reward models and datasets are benchmark-specific and may not generalize

When Not To Use

If you lack any preference data or budget for even small human studies

When ultra-low-latency generation is required and test-time steering is infeasible

Failure Modes

Reward hacking where model maximizes proxy metric but degrades real quality

Diversity collapse after reward-driven fine-tuning

Core Entities

Models

DDPMDDIMLatent Diffusion Model (LDM)Stable Diffusion 3 (SD3)SD3-TurboDALL-E 3

Metrics

AccuracyCLIP scoreHPS v2PickScoreVP-ScoreFIDInception Score

Datasets

HPD v1HPD v2Pick-a-Pic v1ImageRewardDBMHPRichHF-18KPicsart Image-SocialVisionPreferDiffusionDB

Benchmarks

GENEvalGenAI-BenchHEIMVPEvalGEND-Eval (compositional tests)

Context Entities

Models

ReFLDRaFTTDPO-RPRDPSDPOLaSRO

Metrics

Elo correlationNSS/KLD/AUC-Judd (saliency)VQAScore (compositional correctness)

Datasets

DreamBench++ (AI-annotated benchmarks)VisionReward (multi-dimensional)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Alignment research is heavily concentrated on language models; diffusion model alignment is a small fraction.

Most reward models achieve under 80% pairwise preference prediction on standard benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

KLQ: a token-level Q-learning alternative to PPO that matches reward performance and wins LLM-as-a-judge tests

Key finding

MM-RLHF: 120k human preference pairs, a critique-based reward model, and dynamic reward scaling to align multimodal LLMs

Key finding

Reduce multimodal model hallucinations by learning from segment-level human corrections

Key finding

Alignment reshapes who LLMs serve: widens English dialect gaps, helps some languages, and skews country opinions.

Key finding

FSPO: reward-wise RL that checks factuality at each reasoning step to cut hallucinations and boost reasoning

Key finding