[22’ ICLR-WS] Video Diffusion Models

Date: 2024.04.03 Updated: 2024.04.03

카테고리: Generation

0. Abstract

Video generation을 위한 diffusion model을 제안하였다.
Factorized space-time U-Net을 사용하였기에 image data 및 video data로부터 joint training이 가능하고, 이를 통해 최적화 과정이 빨라지고 안정화된다.
Reconstruction-guided sampling이라는 spatial & temporal video extension 방법을 제안하였다.

1. Video Diffusion Models

1.1. Factorized space-time U-Net

저자들은 아주 간단하게 U-Net을 확장시켜 video generation이 가능하도록 했다. Frame 수를 새로운 first axis로 추가하고, 해당 축에 대하여 temporal attention block을 추가했다. 이러한 방법은 temporal attention과 spatial attention을 따로 분리하는 방법으로 해석할 수 있기 때문에 factorized space-time attention이라고 한다. Video transformer에서는 이것이 계산적으로 효율적인 것이 이미 알려져 있다. 또한 간단히 temporal attention block을 적용하지 않으면 여러 이미지에 대한 image generation으로 해석할 수 있고, 따라서 image data에 대해서도 joint training이 가능하다. 이러한 joint training은 sample quality 향상에 아주 큰 도움이 된다. 직관적으로도 image data에 대한 훈련은 spatial attention block의 성능 개선에 도움이 될 것으로 보인다.

1.2. Reconstruction-guided Sampling

Background

Data $\mathbf{x}$에 대하여 본 논문은 아래와 같은 predictor-corrector sampler를 통해 sampling을 진행한다. 먼저 predictor인 ancestral sampler step을 보면, reverse process $\mathbf{z}_ t \rightarrow \mathbf{z}_ s$ 과정인 다음 식을 통하여 $\mathbf{z}_ s$를 얻는다. 이는 Improved DDPM의 식과 동일하다.

\[\mathbf{z}_ s = \tilde{\mu}_ {s \vert t} (\mathbf{z}_ t, \hat{x}_ \theta (\mathbf{z}_ t)) + \sqrt{(\tilde{\sigma}_ {s \vert t} ^ 2)^ {1 - \gamma}(\sigma_ {t \vert s} ^ 2) ^ \gamma} \epsilon\]

이후 corrector인 Langevin MCMC를 통해 다음과 같이 $\mathbf{z}_ s$를 업데이트한다.

\[\mathbf{z}_ s \leftarrow \mathbf{z}_ s - \frac{1}{2} \sigma_ s \epsilon_ \theta (\mathbf{z}_ s) + \sqrt{\delta} \sigma_ s \epsilon^ \prime\]

Replacement vs Reconstruction Guidance

긴 video를 한 번에 만드는 것은 매우 어렵다. 따라서 본 논문에서는 한 번에 16프레임만 생성하고 autoregressive하게 이후 frame을 만드는 것을 제안한다. 즉, 먼저 첫 16프레임 $\mathbf{x} ^ a \sim p_\theta (\mathbf{x})$를 생성하고, 이후 이어지는 영상 $\mathbf{x} ^ b \sim p_\theta (\mathbf{x} ^ b \vert \mathbf{x} ^ a)$를 생성한다. 혹은 낮은 프레임 수로 $\mathbf{x} ^ a$를 생성하고, 이후 그 사이를 높은 프레임 수로 $\mathbf{x} ^ b$를 생성해 채워넣을 수도 있다. 그러나 이러한 방법은 conditional model $p_\theta (\mathbf{x} ^ b \vert \mathbf{x} ^ a)$가 필요하다. 논문에서는 이를 피하고, 대신 unconditional model로부터 근사적으로 $\mathbf{x} ^ b$를 생성하는 방법을 제안한다.

SBM(Score-based Model)에서 제안되었던 방법은 replacement method라고 부른다. 먼저 jointly trained diffusion model $p_\theta (\mathbf{x} = \left[ \mathbf{x} ^ a, \mathbf{x} ^ b \right])$를 훈련한다. 이를 통해 sampling step $p_\theta (\mathbf{z}_ s \vert \mathbf{z}_ t)$에서 일단 $\mathbf{z}_ s = \left[ \mathbf{z}_ s ^ a, \mathbf{z}_ s ^ b \right]$를 얻는다. 그리고 우리가 알고 있는 $\mathbf{x} ^ a$를 이용하여 $\hat{\mathbf{z}} _ s ^ a$를 생성하고, 이를 원래 $\mathbf{z}_ s ^ a$와 교체한다. (이것이 replacement method라고 이름 붙는 이유이다.) 이를 다시 diffusion model $\hat{\mathbf{x}}_ \theta (\left[ \hat{\mathbf{z}} _ s ^ a, \mathbf{z}_ s ^ b \right])$에 넣고 다음 reverse process를 진행한다.

그러나 이러한 방법으로 생성된 $\mathbf{x}^ b$는 그 자체로는 괜찮을지 모르나 $\mathbf{x} ^ a$와의 coherence가 떨어진다. 마치 RePaint에서 semantic consistency가 떨어지는 것과 같은 원리이다. 저자들은 이 문제를 다음과 같이 해석한다. $\mathbf{z}_ s ^ b$는 denoising step을 거치면서 $\hat{x}_ \theta ^ b (\mathbf{z}_ t) \approx \mathbb{E}_ q \left[\mathbf{x}^ b \vert \mathbf{z}_ t \right]$에 의해 업데이트되는데, 실제로 필요한 것은 $\mathbb{E}_ q \left[\mathbf{x}^ b \vert \mathbf{z}_ t, \mathbf{x} ^ a \right]$라는 것이다. 즉, $\mathbf{x} ^ a$의 영향을 고려해야 한다. 두 expectation value의 차이를 score function의 측면에서 보면 다음과 같이 나타낼 수 있다고 한다.

\[\mathbb{E}_ q \left[\mathbf{x}^ b \vert \mathbf{z}_ t, \mathbf{x} ^ a \right] = \mathbb{E}_ q \left[\mathbf{x}^ b \vert \mathbf{z}_ t \right] + \frac{\sigma_ t ^ 2}{\alpha _t} \nabla _ {\mathbf{z} _ t ^ b} \log q(\mathbf{x} ^ a \vert \mathbf{z}_ t)\]

따라서 replacement method는 여기서 두 번째 항을 무시하였기에 문제가 생겼다는 것이다. 그러나 문제는 $q(\mathbf{x} ^ a \vert \mathbf{z}_ t)$를 closed form으로 얻을 수 없다는 것이고, 대신 다음과 같이 Gaussian으로 근사한다.

\[q(\mathbf{x} ^ a \vert \mathbf{z}_ t) \approx \mathcal{N}(\hat{\mathbf{x}}_ \theta ^ a (\mathbf{z}_ t), \frac{\sigma_ t ^ 2}{\alpha _t ^ 2} \mathbf{I})\]

이 Gaussian 근사를 통해 아래와 같이 adjusted denoising model $\tilde{\mathbf{x}}_ \theta ^ b$를 얻을 수 있다. 이때 $w_r$은 weighting factor를 추가한 것이다. 이러한 방법을 reconstruction-guided sampling 혹은 reconstruction guidance라고 하고, 다른 guidance와 마찬가지로 $w_r \gt 1$인 큰 값일수록 sample quality가 향상된다고 한다.

\[\tilde{\mathbf{x}}_ \theta ^ b (\mathbf{z}_ t) = \hat{\mathbf{x}}_ \theta ^ b (\mathbf{z}_ t) - \frac{w_r \alpha_t}{2} \nabla_ {\mathbf{z}_ t ^ b} \Vert \mathbf{x} ^ a - \hat{\mathbf{x}}_ \theta ^ a (\mathbf{z}_ t) \Vert ^ 2 _ 2\]

이러한 sampling은 spatial interpolation (super-resolution)에도 적용할 수 있다. 이제 $\mathbf{x} ^ a$는 이전에 생성한 video가 아니라 low-resolution video이다. 디테일이지만 $\mathbf{x} ^ a$와 $\hat{\mathbf{x}}_ \theta ^ a (\mathbf{z}_ t)$의 차이를 계산하기 위해 model output은 downsampling 된다.

\[\tilde{\mathbf{x}}_ \theta (\mathbf{z}_ t) = \hat{\mathbf{x}}_ \theta (\mathbf{z}_ t) - \frac{w_r \alpha_t}{2} \nabla_ {\mathbf{z}_ t } \Vert \mathbf{x} ^ a - \hat{\mathbf{x}}_ \theta ^ a (\mathbf{z}_ t) \Vert ^ 2 _ 2\]

이러한 temporal & spatial interpolation 방법을 연속적으로 적용하여 더 길고, 더 선명한 video 생성이 가능하다. 예를 들어 아래 그림은 먼저 16x64x64 sample을 생성한 후, 9x128x128로 upsampling하면서 동시에 autoregressive하게 샘플을 생성해 최종적으로 4배의 프레임을 가지면서 더 선명한 64x128x128 sample을 생성한 결과를 보여준다.

2. Experiments

2.1. Unconditional Video Modeling

SOTA와의 비교에서 Video Diffusion은 놀라울 정도로 좋은 성능을 보여준다.

2.2. Video Prediction

Conditional generation task에서의 성능을 확인한다. 이때 reconstruction guidance를 사용하였다. Video Diffusion 중에서도 Langevin sampler를 사용한 경우의 성능이 더 좋았다. Step 수가 ancestral sampler와 Langevin sampler가 2배 차이나는데, 이는 Langevin sampler가 실제로는 ancestral sampler + Langevin MCMC를 사용하기 때문이다.

2.2. Text-conditioned Video Generation

Joint training on video and image modeling

Factorized space-time U-Net을 사용하였기 때문에 image data와 video data를 모두 활용한 joint training이 가능하다. 즉, image data training 때에는 temporal attention block masking 후 진행하면 된다. 이를 통해 모델의 spatial knowledge를 키워주는 것이다. 아래 결과를 보면 image data에 대한 joint training이 video data에 대한 성능을 향상시키는 것을 알 수 있다.

Effect of classifier-free guidance

아래 결과들은 CFG의 diversity-fidelity tradeoff를 보여준다.

Autoregressive video extension for longer sequences

여기서는 replacement method와 reconstruction guidance를 비교하였다. Reconstruction guidance가 더 좋은 성능을 보여주었다. 실제 사진을 보면 replacement method는 autoregressive하게 생성한 video 사이가 부자연스럽게 연결되어 있는 것을 볼 수 있다. 즉, temporal coherence가 부족하다.

💡 Summary

Diffusion model을 활용하여 video generation을 수행하는 Video Diffusion Models를 제안하였으며, 이는 diffusion 계열 video generation의 첫 시도들 중 하나이다.
가장 큰 contribution은 reconstruction guidance로, 이를 통해 spatial & temporal video extension을 가능하게 하였다.
Factorized space-time U-Net을 사용하여 image data 및 video data로부터 joint training이 가능하고, 이를 통해 최적화 과정이 빨라지고 안정화된다.
Opinion: Video generation 연구에 있어서 매우 중요한 논문이라고 생각한다. 여기서 Sora가 나오는 데까지 2년밖에 걸리지 않았다는 것도 놀랍다.

I'm rubatoyeong