Survey: aligning diffusion models to human preferences — methods, benchmarks, and open problems

September 11, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

2

Authors

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie

Links

Abstract / PDF

Why It Matters For Business

Aligning diffusion models cuts customer friction and reduces safety risks; aligned models produce outputs that match user intent and lower moderation costs.

Summary TLDR

This is a practical survey of how to align diffusion generative models with human intent. It reviews core alignment methods (RLHF, Direct Preference Optimization/DPO, and test-time techniques), catalogs major preference datasets and reward models, compares trade-offs (compute, stability, scalability), and lists open problems such as data scarcity, reward over-optimization, and benchmark bias. The paper highlights that DPO variants and test-time methods are rapidly adopted for images and other modalities, while RLHF remains powerful but costly and brittle.

Problem Statement

Diffusion models generate high-quality content but often miss user intent or produce undesirable outputs. Standard training optimizes likelihood, not human preference, so models can produce technically plausible but misaligned images. Aligning diffusion models requires new data formats, reward signals, and algorithms because generation is iterative, high-dimensional, and multimodal.

Main Contribution

Comprehensive review of alignment methods applied to diffusion models: RLHF, DPO, and test-time approaches

Catalog and comparison of major human-preference datasets and reward models for T2I

Cross-domain review: video, audio, motion, 3D, molecule generation alignment

Clear summary of open challenges and concrete future directions (pluralistic preferences, data-centric learning, self-alignment)

Key Findings

Alignment research is heavily concentrated on language models; diffusion model alignment is a small fraction.

NumbersLLMs: 89.4% of studies; diffusion models: 10.6% (Google Scholar, Jan 15, 2026)

Most reward models achieve under 80% pairwise preference prediction on standard benchmarks.

NumbersReward-model accuracies typically <80% across benchmarks (Table 4)

Training paradigms trade off compute, stability, and scalability.

NumbersTable compares RLHF (high compute, low scalability), DPO (moderate compute), test-time (low-moderate compute)

Reward over-optimization and brittleness are common failure modes in fine-tuning.

NumbersMultiple works report reward hacking and alignment backfire (citations in Sections 4.1, 4.5)

Results

Accuracy

ValueHPS v2: 83.3% (on HPD v2)

Accuracy

ValueMPS: 74.2% (on MHP)

Community distribution of research

ValueLLMs: 89.4% vs Diffusion: 10.6% (paper counts)

Who Should Care

What To Try In 7 Days

Run a small human preference study (100–500 pairs) on your core prompts to measure current alignment

Test test-time alignment: try prompt optimization and attention control to improve prompt-following without retraining

Train or fine-tune a simple DPO on collected pairs and compare reward scores plus human checks

Agent Features

Frameworks

  • RLHF
  • DPO

Architectures

  • score-based diffusion
  • latent diffusion

Optimization Features

Infra Optimization

  • avoid full trajectory storage to reduce memory in RL fine-tuning

Model Optimization

  • Distillation for faster models (e.g., SD3-Turbo)
  • LoRA

System Optimization

  • reuse of trajectories via importance sampling
  • few-step model alignment strategies

Training Optimization

  • Reward-weighted regression
  • KL regularization
  • gradient checkpointing

Inference Optimization

  • Test-time prompt optimization
  • attention control
  • initial noise optimization

Reproducibility

Data Urls

  • HPD v1 / HPD v2 (referenced datasets)
  • Pick-a-Pic v1
  • ImageRewardDB
  • MHP
  • RichHF-18K
  • VisionPrefer
  • DiffusionDB

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Survey summarizes many methods but does not provide new empirical benchmarks
  • Reward models and datasets are benchmark-specific and may not generalize
  • Practical deployment guidance is constrained by compute and annotation costs

When Not To Use

  • If you lack any preference data or budget for even small human studies
  • When ultra-low-latency generation is required and test-time steering is infeasible
  • When model updates are impossible and reward-model evaluation would be circular

Failure Modes

  • Reward hacking where model maximizes proxy metric but degrades real quality
  • Diversity collapse after reward-driven fine-tuning
  • Poisoning of reward datasets (BadReward) causing unsafe outputs
  • Safety alignment backfire where suppressed concepts re-emerge after fine-tuning

Core Entities

Models

  • DDPM
  • DDIM
  • Latent Diffusion Model (LDM)
  • Stable Diffusion 3 (SD3)
  • SD3-Turbo
  • DALL-E 3

Metrics

  • Accuracy
  • CLIP score
  • HPS v2
  • PickScore
  • VP-Score
  • FID
  • Inception Score

Datasets

  • HPD v1
  • HPD v2
  • Pick-a-Pic v1
  • ImageRewardDB
  • MHP
  • RichHF-18K
  • Picsart Image-Social
  • VisionPrefer
  • DiffusionDB

Benchmarks

  • GENEval
  • GenAI-Bench
  • HEIM
  • VPEval
  • GEND-Eval (compositional tests)

Context Entities

Models

  • ReFL
  • DRaFT
  • TDPO-R
  • PRDP
  • SDPO
  • LaSRO

Metrics

  • Elo correlation
  • NSS/KLD/AUC-Judd (saliency)
  • VQAScore (compositional correctness)

Datasets

  • DreamBench++ (AI-annotated benchmarks)
  • VisionReward (multi-dimensional)