A better visual tokenizer lets language models match or beat diffusion models on ImageNet and video tasks

October 9, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

21

Authors

Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

Links

Abstract / PDF

Why It Matters For Business

A better visual tokenizer can make language-model pipelines produce higher-quality images/videos with fewer inference steps and offer a new compressed token format that speeds downstream generation and saves bandwidth.

Summary TLDR

The paper introduces MAGVIT-v2, a video-aware visual tokenizer that uses a new lookup-free quantization (LFQ) and architectural changes. With this tokenizer, masked language models (MLMs) achieve state-of-the-art image and video generation on benchmark datasets (ImageNet, Kinetics/UCF), deliver compression quality competitive with modern video codecs by human raters, and improve token-based action recognition. The main message: tokenizer design (vocabulary format and architecture) is a key lever to make language models match or outperform diffusion models on standard visual tasks.

Problem Statement

Language models for images/videos lag behind diffusion models. The paper argues the bottleneck is the visual tokenizer (discrete representation). It proposes MAGVIT-v2 with a lookup-free quantizer and causal video-aware architecture to produce compact, expressive tokens that let LMs reach or exceed diffusion baselines under comparable data, model size, and compute.

Main Contribution

MAGVIT-v2: a joint image-video tokenizer with causal 3D convs and architectural tweaks.

Lookup-Free Quantization (LFQ): removes embedding lookups to grow vocabulary (e.g., 2^18 tokens) without large embedding tables.

Evidence that an LM + MAGVIT-v2 matches or beats diffusion on ImageNet and improves video generation, compression, and action-recognition tasks.

Key Findings

On ImageNet 512×512 class-conditional generation, MLM + MAGVIT-v2 achieved FID 1.91 with guidance versus diffusion baseline VDM++ FID 2.65.

NumbersFID 1.91 vs 2.65 (512×512); 28% relative improvement

Video frame-prediction on Kinetics-600: MLM+MAGVIT-v2 reduced FVD from 9.9 (MAGVIT) to 5.2.

NumbersFVD 5.2 vs 9.9 on K600

Human raters prefer MAGVIT-v2 reconstructions over MAGVIT, HEVC, and are comparable to VVC at similar bit rates; LPIPS (perceptual metric) is better: 0.104 for MAGVIT-v2 vs 0.153 for VVC.

NumbersLPIPS 0.104 (MAGVIT-v2) vs 0.153 (VVC) at 0.0384 bpp

Action recognition benefits: using MAGVIT-v2 tokens as inputs raises Kinetics-400 accuracy from 72.29% (MAGVIT) to 75.34%.

NumbersK400 input accuracy 75.34% vs 72.29%

Vocabulary scaling behavior differs: with LFQ reconstruction and generation both improve as vocabulary grows; with standard VQ, LM generation degrades for very large vocabularies.

NumbersLFQ + large vocab shows monotonic generation improvement (Fig.1)

Results

Image generation FID (512×512, guided)

Value1.91 (MAGVIT-v2)

BaselineVDM++ 2.65 (diffusion)

Video generation FVD

Value5.2 (MAGVIT-v2 on K600 frame-prediction)

BaselineMAGVIT 9.9

Perceptual distortion (LPIPS) at 0.0384 bpp

Value0.104 (MAGVIT-v2)

BaselineVVC 0.153; HEVC 0.199

Accuracy

Value75.34% (MAGVIT-v2 on K400)

BaselineMAGVIT 72.29%

Who Should Care

What To Try In 7 Days

Swap in a tokenization step similar to LFQ for your visual pipeline and compare generation quality (FID/LPIPS) on a small held-out set.

Use token factorization to avoid huge embedding tables and measure memory vs accuracy trade-offs.

Run a small subjective preference test for token-based compression vs standard codec at your target bitrates.

Optimization Features

Token Efficiency

  • Shared image/video vocabulary and compact discrete tokens enable token-based compression and faster

Model Optimization

  • Lookup-free quantizer removes embedding lookups to enable very large vocabularies
  • Token factorization reduces softmax/embedding memory cost

System Optimization

  • Causal 3D convolutions allow tokenizing single images and videos with the same model

Training Optimization

  • Entropy penalty and annealing to encourage codebook usage
  • LeCAM regularization for GAN stability

Inference Optimization

  • MLM decoding uses far fewer sampling steps than some diffusion models (e.g., 64 vs 250)
  • Discrete tokens allow direct feed into LM pipelines without decoding to raw pixels first

Reproducibility

Data Urls

  • ImageNet
  • Kinetics-600
  • Kinetics-400
  • UCF-101
  • MCL-JCV
  • SSv2

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • High compute and TPU usage for training and decoding; CPU efficiency not addressed.
  • LFQ uses binary-independence assumption; other quantizers may behave differently.
  • Comparisons limited to standard academic datasets; results on large proprietary web-scale data not shown.

When Not To Use

  • When you must run on CPU-only systems where neural codecs are too slow.
  • When you need guaranteed bit-exact decoding with existing codec toolchains.
  • When training budget is too small to train large-token vocabularies and MLM decoders.

Failure Modes

  • Very large vocabularies with standard VQ can hurt MLM generation (vocab-size sensitivity).
  • LFQ's binary independence may lose some representational nuances on complex textures.
  • Decoder artifacts or temporal flicker if tokenizer or decoder underfits motion.

Core Entities

Models

  • MAGVIT-v2
  • MAGVIT
  • LFQ (Lookup-Free Quantization)
  • VQ-VAE
  • Masked LM (MLM)
  • Autoregressive LM (AR-LM)

Metrics

  • FID
  • FVD
  • LPIPS
  • PSNR
  • MS-SSIM
  • Inception Score (IS)
  • Accuracy

Datasets

  • ImageNet
  • Kinetics-600
  • Kinetics-400
  • UCF-101
  • MCL-JCV
  • SSv2

Benchmarks

  • ImageNet class-conditional generation
  • Kinetics-600 frame prediction
  • UCF-101 class-conditional generation
  • MCL-JCV compression study
  • Action recognition (K400, K600, SSv2)