A better visual tokenizer lets language models match or beat diffusion models on ImageNet and video tasks
A better visual tokenizer can make language-model pipelines produce higher-quality images/videos with fewer inference steps and offer a new compressed token format that speeds downstream generation and saves bandwidth.
Key finding
On ImageNet 512×512 class-conditional generation, MLM + MAGVIT-v2 achieved FID 1.91 with guidance versus diffusion baseline VDM++ FID 2.65.
Numbers: FID 1.91 vs 2.65 (512×512); 28% relative improvement

