Overview
The survey synthesizes many recent methods and datasets and cites empirical comparisons, so guidance is actionable; however, empirical results vary by benchmark and reward model, so apply caution and human checks.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
Aligning diffusion models cuts customer friction and reduces safety risks; aligned models produce outputs that match user intent and lower moderation costs.
Who Should Care
Summary TLDR
This is a practical survey of how to align diffusion generative models with human intent. It reviews core alignment methods (RLHF, Direct Preference Optimization/DPO, and test-time techniques), catalogs major preference datasets and reward models, compares trade-offs (compute, stability, scalability), and lists open problems such as data scarcity, reward over-optimization, and benchmark bias. The paper highlights that DPO variants and test-time methods are rapidly adopted for images and other modalities, while RLHF remains powerful but costly and brittle.
Problem Statement
Diffusion models generate high-quality content but often miss user intent or produce undesirable outputs. Standard training optimizes likelihood, not human preference, so models can produce technically plausible but misaligned images. Aligning diffusion models requires new data formats, reward signals, and algorithms because generation is iterative, high-dimensional, and multimodal.
Main Contribution
Comprehensive review of alignment methods applied to diffusion models: RLHF, DPO, and test-time approaches
Catalog and comparison of major human-preference datasets and reward models for T2I
Key Findings
Alignment research is heavily concentrated on language models; diffusion model alignment is a small fraction.
Most reward models achieve under 80% pairwise preference prediction on standard benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | HPS v2: 83.3% (on HPD v2) | — | — | HPD v2 | Table 4 reports HPS v2 83.3% on HPD v2 | Section 5.2.1 (Table 4) |
| Accuracy | MPS: 74.2% (on MHP) | — | — | MHP | Table 4 reports MPS 74.2% on MHP | Section 5.2.1 (Table 4) |
What To Try In 7 Days
Run a small human preference study (100–500 pairs) on your core prompts to measure current alignment
Test test-time alignment: try prompt optimization and attention control to improve prompt-following without retraining
Train or fine-tune a simple DPO on collected pairs and compare reward scores plus human checks
Agent Features
Frameworks
Architectures
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Survey summarizes many methods but does not provide new empirical benchmarks
Reward models and datasets are benchmark-specific and may not generalize
When Not To Use
If you lack any preference data or budget for even small human studies
When ultra-low-latency generation is required and test-time steering is infeasible
Failure Modes
Reward hacking where model maximizes proxy metric but degrades real quality
Diversity collapse after reward-driven fine-tuning

