Survey: aligning diffusion models to human preferences — methods, benchmarks, and open problems

September 11, 20247 min

Overview

Decision SnapshotReady For Pilot

The survey synthesizes many recent methods and datasets and cites empirical comparisons, so guidance is actionable; however, empirical results vary by benchmark and reward model, so apply caution and human checks.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 40%

Authors

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Aligning diffusion models cuts customer friction and reduces safety risks; aligned models produce outputs that match user intent and lower moderation costs.

Who Should Care

Summary TLDR

This is a practical survey of how to align diffusion generative models with human intent. It reviews core alignment methods (RLHF, Direct Preference Optimization/DPO, and test-time techniques), catalogs major preference datasets and reward models, compares trade-offs (compute, stability, scalability), and lists open problems such as data scarcity, reward over-optimization, and benchmark bias. The paper highlights that DPO variants and test-time methods are rapidly adopted for images and other modalities, while RLHF remains powerful but costly and brittle.

Problem Statement

Diffusion models generate high-quality content but often miss user intent or produce undesirable outputs. Standard training optimizes likelihood, not human preference, so models can produce technically plausible but misaligned images. Aligning diffusion models requires new data formats, reward signals, and algorithms because generation is iterative, high-dimensional, and multimodal.

Main Contribution

Comprehensive review of alignment methods applied to diffusion models: RLHF, DPO, and test-time approaches

Catalog and comparison of major human-preference datasets and reward models for T2I

Key Findings

Alignment research is heavily concentrated on language models; diffusion model alignment is a small fraction.

NumbersLLMs: 89.4% of studies; diffusion models: 10.6% (Google Scholar, Jan 15, 2026)

Practical UseExpect fewer off-the-shelf alignment datasets and tools for diffusion models; plan extra effort to build or adapt preference data when working on image/video alignment.

Evidence RefIntroduction (Fig.2b)

Most reward models achieve under 80% pairwise preference prediction on standard benchmarks.

NumbersReward-model accuracies typically <80% across benchmarks (Table 4)

Practical UseReward models are imperfect evaluators; validate with human checks and avoid over-reliance on a single reward model during tuning.

Evidence RefSection 5.2.1 (Table 4)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyHPS v2: 83.3% (on HPD v2)HPD v2Table 4 reports HPS v2 83.3% on HPD v2Section 5.2.1 (Table 4)
AccuracyMPS: 74.2% (on MHP)MHPTable 4 reports MPS 74.2% on MHPSection 5.2.1 (Table 4)

What To Try In 7 Days

Run a small human preference study (100–500 pairs) on your core prompts to measure current alignment

Test test-time alignment: try prompt optimization and attention control to improve prompt-following without retraining

Train or fine-tune a simple DPO on collected pairs and compare reward scores plus human checks

Agent Features

Frameworks
RLHFDPO
Architectures
score-based diffusionlatent diffusion

Optimization Features

Infra Optimization
avoid full trajectory storage to reduce memory in RL fine-tuning
Model Optimization
Distillation for faster models (e.g., SD3-Turbo)LoRA
System Optimization
reuse of trajectories via importance samplingfew-step model alignment strategies
Training Optimization
Reward-weighted regressionKL regularizationgradient checkpointing
Inference Optimization
Test-time prompt optimizationattention controlinitial noise optimization

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HPD v1 / HPD v2 (referenced datasets)Pick-a-Pic v1ImageRewardDBMHPRichHF-18KVisionPreferDiffusionDB

Risks & Boundaries

Limitations

Survey summarizes many methods but does not provide new empirical benchmarks

Reward models and datasets are benchmark-specific and may not generalize

When Not To Use

If you lack any preference data or budget for even small human studies

When ultra-low-latency generation is required and test-time steering is infeasible

Failure Modes

Reward hacking where model maximizes proxy metric but degrades real quality

Diversity collapse after reward-driven fine-tuning

Core Entities

Models

DDPMDDIMLatent Diffusion Model (LDM)Stable Diffusion 3 (SD3)SD3-TurboDALL-E 3

Metrics

AccuracyCLIP scoreHPS v2PickScoreVP-ScoreFIDInception Score

Datasets

HPD v1HPD v2Pick-a-Pic v1ImageRewardDBMHPRichHF-18KPicsart Image-SocialVisionPreferDiffusionDB

Benchmarks

GENEvalGenAI-BenchHEIMVPEvalGEND-Eval (compositional tests)

Context Entities

Models

ReFLDRaFTTDPO-RPRDPSDPOLaSRO

Metrics

Elo correlationNSS/KLD/AUC-Judd (saliency)VQAScore (compositional correctness)

Datasets

DreamBench++ (AI-annotated benchmarks)VisionReward (multi-dimensional)