[21’ ICLR] DDPM: Denoising Diffusion Probabilistic Models

Date: 2024.03.19 Updated: 2024.03.21

카테고리: Generation

태그: Diffusion

💡 이 글은 『2024 YAI 봄전반기 생성모델팀』으로 진행되었습니다.

0. Abstract

Nonequilibrium thermodynamics에서 영감을 받아 latent variable model의 일종인 diffusion probabilistic model을 통해 이미지를 생성하였다.
Denoising score matching, Langevin dynamics와 diffusion probabilistic model을 결합시켜 weighted variational bound를 최적화하는 과정을 통해 모델을 훈련했다.
논문에서 제안한 모델은 progressive lossy decompression scheme (즉, 노이즈가 있는 이미지에서 이미지를 조금씩 복구하는 방식)으로 이미지를 생성하며, 이는 generalization of autoregrssive decoding (즉, DDPM은 연속적인 autoregressive decoding을 수행한 것)으로 해석될 수 있다.
DDPM은 생성모델로 SOTA를 달성했고, 다른 생성모델과 비교했을 때 높은 품질의 이미지를 생성할 수 있다.

1. Introduction

본 논문의 contribution은 다음과 같다.

Diffusion model이 다른 생성모델들보다 높은 품질의 샘플을 생성해낼 수 있음을 보였다.
Diffusion model에 특정한 parametrization을 시키면, 학습 시에는 multiple noise level에서의 denoising score matching과 동일하고, 샘플링 시에는 annealed Langevin dynamics와 동일함을 보였다(Equivalence).
Sample quality는 뛰어나지만, log-likelihood는 다른 likelihood-based model들보다 떨어진다. 이때 lossless codelength가 인지할 수 없을 정도의 이미지 디테일을 설명하는 데 사용됨을 확인하였다. 이를 lossy compression으로 설명했다.

2. Background

2.1. Forward & Reverse Process

Diffusion model은 다음과 같이 정의된 latent variable model이다. 이때 latent variable $\mathbf{x}_ 1, \mathbf{x}_ 2, \cdots, \mathbf{x}_ T$는 데이터 $\mathbf{x}_ 0 \sim q(\mathbf{x}_ 0)$와 같은 차원을 가지고 있다.

\[p_\theta (\mathbf{x}_ 0) := \int p_\theta (\mathbf{x}_ {0:T}) d\mathbf{x}_ {1:T}\]

이때 joint distribution $p_\theta (\mathbf{x}_ {0:T})$는 reverse process라고 하며, Gaussian transition을 따르는 Markov chain으로 다음과 같이 정의된다. 시작은 $p(\mathbf{x}_ T) = \mathcal{N}(\mathbf{x}_ T ; 0, \mathbf{I})$이다.

\[p_\theta (\mathbf{x}_ {0:T}) := p(\mathbf{x}_ T) \prod _ {t=1} ^T p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) \\ p_\theta (\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) := \mathcal{N}(\mathbf{x}_ {t-1} ; \mu_\theta (\mathbf{x}_ t, t), \Sigma_\theta (\mathbf{x}_ t, t))\]

Diffusion model이 다른 latent variable model과 다른 점은 approximate posterior $q(\mathbf{x}_ {1:T} \vert \mathbf{x}_ 0)$, 즉 forward process 혹은 diffusion process가 다음과 같이 variance schedule $\beta_1, \beta_2, \cdots, \beta_T$에 따라 Gaussian noise를 조금씩 더하는 Markov chain으로 고정되어 있다는 것이다. 참고로 variance $\beta_t$는 reparametrization을 통해 학습시킬 수도 있고 hyperparameter로 설정할 수도 있는데, 본 논문에서는 hyperparameter로 고정하였다.

\[q(\mathbf{x}_ {1:T} \vert \mathbf{x}_ 0) := \prod _ {t=1} ^T q(\mathbf{x}_ t \vert \mathbf{x}_ {t-1}) \\ q(\mathbf{x}_ t \vert \mathbf{x}_ {t-1}) := \mathcal{N}(\mathbf{x}_ t ; \sqrt{1-\beta_t} \mathbf{x}_ {t-1}, \beta_t \mathbf{I})\]

그리고 forward process의 경우 $\mathbf{x}_ 0$로부터 임의의 $t$에 대해 $\mathbf{x}_ t$를 한번에 샘플링할 수 있다는 특성이 있다. 즉, $\mathbf{x}_ 0 \rightarrow \mathbf{x}_ 1 \rightarrow \cdots \rightarrow \mathbf{x}_ t$를 거치는 것이 아니라, $\mathbf{x}_ 0 \rightarrow \mathbf{x}_ t$를 한번에 샘플링할 수 있다는 것이다. 이는 결국 $q(\mathbf{x}_ t \vert \mathbf{x}_ {t-1})$을 여러 번 중첩시켜 얻어낸 식이라고 생각할 수 있다.

\[q(\mathbf{x}_ t \vert \mathbf{x}_ 0) = \mathcal{N}(\mathbf{x}_ t ; \sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0, (1-\bar{\alpha}_ t) \mathbf{I}) \\ \alpha_t := 1-\beta_t, \quad \bar{\alpha}_ t := \prod _ {s=1} ^t \alpha_s\]

2.2. Training

Diffusion model의 학습은 다른 latent variable model과 같이, log-likelihood 자체의 계산이 intractable하기 때문에 variational bound를 최적화하는 방식으로 이루어진다. negative log likelihood에 대한 variational bound는 다음과 같이 정의된다.

\[\begin{aligned} \mathbb{E} \left[ - \log p_\theta(\mathbf{x}_ 0) \right] &\leq \mathbb{E}_ q \left[ - \log \frac{p_\theta(\mathbf{x}_ {0:T})}{q(\mathbf{x}_ {1:T} \vert \mathbf{x}_ 0)} \right] \\ &= \mathbb{E}_ q \left[ - \log p_\theta(\mathbf{x}_ {T}) - \sum_ {t \geq 1} \log \frac{p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t)}{q(\mathbf{x}_ t \vert \mathbf{x}_ {t-1})} \right] \\ &:= L \end{aligned}\]

길고 긴 계산(Appendix에 유도 식이 수록되어 있음.)으로 이 식을 정리하면 다음과 같다.

\[\mathbb{E}_ q \left[ D_{KL} \left( q(\mathbf{x}_ T \vert \mathbf{x}_ 0) \Vert p(\mathbf{x}_ T) \right) + \sum_ {t \gt 1} D_{KL} \left( q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0) \Vert p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) \right) - \log p_\theta(\mathbf{x}_ 0 \vert \mathbf{x}_ 1) \right]\]

여기서 각각 $L_T, L_{t-1}, L_0$를 다음과 같이 정의한다.

\[\begin{aligned} L_T &:= D_{KL} \left( q(\mathbf{x}_ T \vert \mathbf{x}_ 0) \Vert p(\mathbf{x}_ T) \right) \\ L_{t-1} &:= D_{KL} \left( q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0) \Vert p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) \right) \\ L_0 &:= - \log p_\theta(\mathbf{x}_ 0 \vert \mathbf{x}_ 1) \end{aligned}\]

위 방법을 사용하면 high variance를 가지는 Monte Carlo estimates 대신, low variance를 가지는 Rao-Blackwellized fashion으로 계산된다고 한다. 이를 설명하기 위해서는 Forward process posterior인 $q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0)$를 알아야 한다. 이는 다음과 같이 $\mathbf{x}_ 0$가 condition으로 있으면 tractable한 Gaussian으로 나타낼 수 있다.

\[q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0) = \mathcal{N}(\mathbf{x}_ {t-1} ; \tilde{\mu}_ t (\mathbf{x}_ t, \mathbf{x}_ 0), \tilde{\beta}_ t \mathbf{I})\]

이때 $\tilde{\mu}_ t (\mathbf{x}_ t, \mathbf{x}_ 0), \tilde{\beta}_ t$는 다음과 같이 정의된다.

\[\tilde{\mu}_ t (\mathbf{x}_ t, \mathbf{x}_ 0) := \frac{\sqrt{\bar{\alpha}_ {t-1} \beta_t}}{1 - \bar{\alpha}_ t} \mathbf{x}_ 0 + \frac{\sqrt{\alpha_t} (1- \bar{\alpha}_ {t-1})}{1 - \bar{\alpha}_ t} \mathbf{x}_ t \quad \tilde{\beta}_ t := \frac{1 - \bar{\alpha}_ {t-1} }{1 - \bar{\alpha}_ t} \beta_t\]

따라서 $L_{t-1}$에서 $q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0)$와 $p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t)$의 KL divergence를 계산하는 것은 Gaussian 간의 KL divergence 계산이기 때문에 Monte Carlo 대신 closed form으로 계산할 수 있다는 것이다. 다른 모델들에 비해 학습이 안정적으로 이루어지겠다고 생각할 수 있다.

3. Diffusion models and denoising autoencoders

3.1. Forward process and $L_T$

\[L_T = D_{KL} \left( q(\mathbf{x}_ T \vert \mathbf{x}_ 0) \Vert p(\mathbf{x}_ T) \right)\]

$L_T$는 VAE로 따지면 일종의 regularization term이다. Posterior로부터 sampling한 $\mathbf{x}_ T$와 prior로부터 sampling한 $\mathbf{x}_ T$의 분포를 유사하게 만들어 $\mathbf{x}_ 0$가 없어도 그럴듯한 샘플링이 되도록 하는 것이다. Prior는 Gaussian이 될 것이므로 위 $L_T$는 forward process로 Gaussian이 잘 생성되도록 하는 term이라고 볼 수 있다. 그런데 posterior $q(\mathbf{x}_ T \vert \mathbf{x}_ 0)$는 다음과 같았다.

\[q(\mathbf{x}_ T \vert \mathbf{x}_ 0) = \mathcal{N}(\mathbf{x}_ T ; \sqrt{\bar{\alpha}_ T} \mathbf{x}_ 0, (1-\bar{\alpha}_ T) \mathbf{I})\]

그런데 본 논문에서는 $\beta_t$를 편의상 모두 고정하기로 하였으므로, $L_T$에는 학습할 수 있는 파라미터가 없다. 따라서 본 논문에서는 $L_T$를 상수로 보고 학습할 때는 무시하였다.

3.2. Reverse process and $L_{1:T-1}$

\[L_{t-1} = D_{KL} \left( q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0) \Vert p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) \right)\]

직관적으로 위 식은 reverse process와 forward process 간의 분포가 유사하도록 만드는 식임을 알 수 있다. 우리는 $p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t)$를 Gaussian, 즉 $\mathcal{N}(\mathbf{x}_ {t-1}; \mu_\theta (\mathbf{x}_ t, t), \Sigma_\theta (\mathbf{x}_ t, t))$로 정의하였다. 여기서는 정확히 $\mu_\theta (\mathbf{x}_ t, t)$, $\Sigma_\theta (\mathbf{x}_ t, t)$를 어떻게 설정할지에 대해 논의한다.

$\Sigma_\theta (\mathbf{x}_ t, t)$

첫째, $\Sigma_\theta (\mathbf{x}_ t, t) = \sigma_t^2 \mathbf{I}$로 설정하였다. 이때 $\sigma_t$는 학습되지 않는, time dependent constant이다. $\sigma_t$를 설정하는 방법은 자유인데, 여기에서는 두 가지 방법을 제시한다. 먼저 $\mathbf{x}_ 0 \sim \mathcal{N}(0,1)$로부터 학습할 때에는 $\sigma_t^2 = \beta_t$로 두는 것이 가장 최적임이 알려져 있다. 그리고 만약 $\mathbf{x}_ 0$이 정확히 한 점인 경우(deterministic)에는, $\sigma_t^2 = \tilde{\beta}_ t = \frac{1 - \bar{\alpha}_ {t-1} }{1 - \bar{\alpha}_ t} \beta_t$로 두는 것이 가장 최적임이 알려져 있다. 물론 $\mathbf{x}_ 0$가 실제로는 이런 아름다운 분포들에서 샘플링되는 것은 아니지만, 저자들 나름대로 좋은 $\sigma_t$를 설정했다는 정당화 과정이라고 생각하면 될 것 같다. 실험적으로는 둘 다 비슷한 결과를 얻었다고 한다.

$\mu_\theta (\mathbf{x}_ t, t)$

둘째, $\mu_\theta (\mathbf{x}_ t, t)$는 결론부터 말하자면 neural network인데, $L_{t-1}$을 정리하면서 $\mu_\theta$를 parametrization할 것이다.

다시 $q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0)$, $p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t)$를 생각해보면 다음과 같다.

\[\begin{aligned} q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t, \mathbf{x}_ 0) &= \mathcal{N}(\mathbf{x}_ {t-1} ; \tilde{\mu}_ t (\mathbf{x}_ t, \mathbf{x}_ 0), \tilde{\beta}_ t \mathbf{I}) \\ p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) &= \mathcal{N}(\mathbf{x}_ {t-1} ; \mu_\theta (\mathbf{x}_ t, t), \sigma_t^2 \mathbf{I}) \end{aligned}\]

$L_{t-1}$은 이 두 Gaussian 간의 KL divergence이므로, 다음과 같이 쓸 수 있다. 이때 $C$는 $\theta$와 관련없는 나머지 항들이다.

\[L_{t-1} = \mathbb{E}_ q \left[ \frac{1}{2 \sigma_t^2} \Vert \tilde{\mu}_ t (\mathbf{x}_ t, \mathbf{x}_ 0) - \mu_\theta (\mathbf{x}_ t, t) \Vert ^2 \right] + C\]

해석하자면 $\mu_\theta$는 forward process posterior의 평균 $\tilde{\mu}_ t$와 가까워지도록 학습된다. 이때, $\mathbf{x}_ t$를 reparametrization trick을 이용하면 다음과 같이 나타낼 수 있다.

\[\mathbf{x}_ t (\mathbf{x}_ 0, \epsilon) = \sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0 + \sqrt{1 - \bar{\alpha}_ t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \\ \because q(\mathbf{x}_ t \vert \mathbf{x}_ 0) = \mathcal{N}(\mathbf{x}_ t ; \sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0, (1-\bar{\alpha}_ t) \mathbf{I}) \\\]

따라서, $\mathbf{x}_ 0$를 $\mathbf{x}_ t, \epsilon$으로 다음과 같이 나타낼 수 있다.

\[\mathbf{x}_ 0 (\mathbf{x}_ t, \epsilon) = \frac{1}{\sqrt{\bar{\alpha}_ t}} \left( \mathbf{x}_ t (\mathbf{x}_ 0, \epsilon) - \sqrt{1 - \bar{\alpha}_ t} \epsilon \right)\]

이 값을 $\tilde{\mu}_ t (\mathbf{x}_ t, \mathbf{x}_ 0)$에 대입하여 정리하고, $L_{t-1}$에 대입하면 다음과 같다.

\[L_{t-1} = \mathbb{E}_ {\mathbf{x}_ 0, \epsilon} \left[ \frac{1}{2 \sigma_t^2} \left\Vert \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_ t (\mathbf{x}_ 0, \epsilon) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_ t}} \epsilon \right) - \mu_\theta (\mathbf{x}_ t(\mathbf{x}_ 0, \epsilon), t) \right\Vert ^2 \right] + C\]

$\mathbf{x}_ t$는 input으로 넣을 수 있는 값이므로, 다음과 같이 parametrization할 수 있다.

\[\mu_\theta (\mathbf{x}_ t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_ t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_ t}} \epsilon_\theta (\mathbf{x}_ t, t) \right)\]

즉, $\mu_\theta$를 위와 같이 나타내면 $\mathbf{x}_ t$의 noise $\epsilon$을 추정하는 $\epsilon_\theta$를 훈련시키는 것과 동일하게 된다. 이때 $\mathbf{x}_ {t-1} \sim p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t)$를 샘플링하는 식을 다음과 같이 나타낼 수 있다.

\[\begin{aligned} \mathbf{x}_ {t-1} &= \mu_\theta (\mathbf{x}_ t, t) + \sigma_t \mathbf{z}, \quad \mathbf{z} \sim \mathcal{N}(0, \mathbf{I}) \\ &= \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_ t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_ t}} \epsilon_\theta (\mathbf{x}_ t, t) \right) + \sigma_t \mathbf{z} \end{aligned}\]

이 식에서 $\epsilon_\theta$가 data density의 gradient를 학습한다고 생각하면 Langevin dynamics의 식과 비슷한 형태임을 알 수 있다. 그리고 지금까지의 결과를 $L_{t-1}$에 대입하면 다음과 같다.

\[L_{t-1} = \mathbb{E}_ {\mathbf{x}_ 0, \epsilon} \left[ \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1- \bar{\alpha}_ t)} \Vert \epsilon - \epsilon_\theta (\sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0 + \sqrt{1 - \bar{\alpha}_ t} \epsilon, t) \Vert ^2 \right] + C\]

$\epsilon_\theta (\sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0 + \sqrt{1 - \bar{\alpha}_ t} \epsilon, t)$는 $\epsilon_\theta (\mathbf{x}_ t, t)$와 동일한 것으로, reparametrization trick을 사용했음을 명시하는 표현이다. 이 식을 잘 보면 $t$에 따른 multiple noise level에서의 denoising score matching을 학습하는 식으로 볼 수 있음을 알 수 있다.

3.3. Data scaling, reverse process decoder, and $L_0$

\[L_0 = - \log p_\theta(\mathbf{x}_ 0 \vert \mathbf{x}_ 1)\]

VAE 기준으로, 일종의 reconstruction loss이다. 저자들은 이미지 데이터 $\lbrace 0, 1, \cdots, 255 \rbrace$를 scaling하여 $[-1, 1]$로 만들었다. 이때 $p_\theta(\mathbf{x}_ 0 \vert \mathbf{x}_ 1)$의 마지막 과정은 다시 이미지의 log-likelihood를 discrete하게 얻어야 하므로, 기술적으로 다음과 같은 discrete decoder를 사용하였다. 서술상으로는 어려워 보이지만 실제로는 픽셀 $x_0^i$ 값의 앞뒤로 $1/255$만큼을 적분하여 각 픽셀의 log-likelihood를 구하겠다는 것이다. 저자들은 이것 대신 conditional autoregressive model과 같은 discrete decoder를 사용할 수도 있다고 한다. 여기서 $D$는 data dimension이다.

\[p_\theta(\mathbf{x}_ 0 \vert \mathbf{x}_ 1) = \prod _ {i=1} ^{D} \int_ {\delta_{-} (x_0^i)} ^ {\delta_{+} (x_0^i)} \mathcal{N} (x; \mu_\theta^i (\mathbf{x}_ 1, 1), \sigma_1^2) dx \\ \delta_{+} (x) = \begin{cases} \infty & \text{if } x = 1 \\ x + \frac{1}{255} & x \lt 1 \end{cases}, \quad \delta_{-} (x) = \begin{cases} -\infty & \text{if } x = -1 \\ x - \frac{1}{255} & x \gt -1 \end{cases}\]

3.4. Simplified training objective

이론적으로는 지금까지 정리한 variational bound를 loss로 사용해도 되지만, 실제로는 다음과 같은 simplified training objective를 사용하는 경우 sample quality가 더 좋았다고 한다.

\[L_{\text{simple}}(\theta) := \mathbb{E}_ {t, \mathbf{x}_ 0, \epsilon} \left[ \Vert \epsilon - \epsilon_\theta (\sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0 + \sqrt{1 - \bar{\alpha}_ t} \epsilon, t) \Vert ^2 \right]\]

위 식은 $L_{t-1}$과 닮아 있다. 아래 식을 보면, simplified training objective는 $L_{t-1}$에서 weight을 제거하여 간단하게 한 버전이라는 것을 알 수 있다. 이는 NCSN에서의 denoising score matching과 같은 식이다.

논문에서는 $L_{\text{simple}}$를 다음과 같이 해석한다. 오히려 standard variational bound $\sum L_{t-1}$에 비하면 weight이 달라졌으므로 일종의 weighted variational bound가 된 것으로 해석할 수 있고, 이는 원래보다 작은 $t$에 대해 down-weight을 주는 효과가 있다고 한다. 즉, 오히려 $L_{\text{simple}}$을 사용하면 noise가 작은 $t$에서 denoising을 학습하는 것보다 noise가 큰 $t$에서 denoising을 학습하는 것을 더 중요하게 학습할 수 있게 된다는 것이다.

실제 학습 및 샘플링 알고리즘은 다음과 같다.

4. Experiment

저자들은 $T=1000$을 사용했고, $\beta_1 = 10^{-4}, \beta_T = 0.02$가 되도록 설정하였으며 그 사이는 linear하게 증가하는 것으로 하였다. 이 hyperparameter는 $L_T$를 충분히 작게 하는 선에서 최대한 작은 값으로 두었다. 만약 $L_T$가 충분히 작지 않으면, standard Gaussian에서 generation이 잘 되지 않을 것이고, 그렇다고 $\beta$를 너무 크게 하면 denoising이 어려울 것이다. 따라서 이를 적당히 leverage하는 값으로 정한 것이다.

Reverse process는 U-Net backbone으로 모델링했고, NCSN과 같이 time parameter를 넣어주기 위해 Transformer sinusoidal PE를 사용하였다. $16 \times 16$ feature map resolution 수준에서 self-attention도 사용하였다.

4.1. Sample quality

CIFAR-10에서의 quantitative result를 보면, IS(Inception Score)는 대부분의 모델보다 높았고 FID(Frechet Inception Distance)는 unconditional model에서 SOTA를 달성했다. Training set과 test set에서의 NLL도 비슷하여 overfitting이 일어나지 않았음을 볼 수 있다. 그러나 NLL(Negative Log Likelihood)는 다른 모델들보다 높았다. 그 이유를 저자들은 4.3절에서 정당화한다. 아래는 qualitative result이다.

4.2. Reverse process parameterization and training objective ablation

저자들은 reverse process parameterization과 training objective에 대한 실험을 진행하였다. 즉, $\mu_\theta$를 $\epsilon_\theta$로 바꾼 것이 효과가 있는지, 그리고 $L_{t-1}$을 $L_{\text{simple}}$로 바꾼 것이 효과가 있는지 실험한 것이다. 그리고 추가로 $\Sigma_\theta(\mathbf{x}_ t, t)$를 본 논문에서는 $\sigma_t^2 \mathbf{I}$로 학습되지 않는 time dependent constant, 즉 fixed isotropic $\Sigma$로 두었는데, 이를 학습되는 파라미터로 바꾼 것이 효과가 있는지도 실험하였다.

결과로 $\mu_\theta$를 사용한 경우 simplified training objective를 사용한 경우 학습이 되지 않았고, 일반적인 variational bound인 $L$을 사용해야 했다. 그리고 $\Sigma$를 learnable parameter로 둘 경우에는 오히려 학습이 불안정했다. 따라서 본 논문에서 제시한 parameterization과 training objective가 가장 효과적임을 알 수 있다.

4.3. Progressive coding

Progressive lossy compression

DDPM의 sample quality는 좋은데도 불구하고 NLL, 즉 lossless codelength가 다른 모델들보다 높은 이유를 rate-distortion theory로 설명한다. 잘 모르는 정보이론 부분이라 이해한 부분만 서술해본다. 먼저 다음과 같은 $\hat{\mathbf{x}}_ 0$를 정의한다.

\[\hat{\mathbf{x}}_ 0 = (\mathbf{x}_ t - \sqrt{1 - \bar{\alpha}_ t} \epsilon_\theta(\mathbf{x}_ t)) / \sqrt{\bar{\alpha}_ t}\]

위 식은 다음 식을 이용한 것이다.

\[q(\mathbf{x}_ t \vert \mathbf{x}_ 0) = \mathcal{N}(\mathbf{x}_ t ; \sqrt{\bar{\alpha}_ t} \mathbf{x}_ 0, (1-\bar{\alpha}_ t) \mathbf{I})\]

즉 $\hat{\mathbf{x}}_ 0$는 $\mathbf{x}_ t$까지의 정보만 있을 때 $\mathbf{x}_ 0$를 추정한 것이다. 이때 $\mathbf{x}_ 0$를 얼마나 잘 추정했는지, 즉 정보의 왜곡이 얼마정도인지를 distortion으로 정의하는데, distortion은 RMSE, 즉 $\sqrt{\Vert \mathbf{x}_ 0 - \hat{\mathbf{x}}_ 0 \Vert ^2 / D}$로 정의하였다. 따라서 reverse process가 진행될수록 distortion이 감소할 것이다. 그리고 lossless codelength를 rate라는 개념으로 표현하는데, 이는 일종의 정보량으로, $\mathbf{x}_ t$에서 $t$가 작아질수록 실제로 $\mathbf{x}_ 0$에 대해 알게 되는 정보가 늘어나므로 점점 reverse process가 진행될수록 rate가 증가할 것이다. 이를 아래와 같이 그래프로 나타내었다.

그리고 이를 종합하여 rate-distortion plot으로 나타내면 가장 오른쪽 그래프와 같다. 이 그래프로부터 알 수 있는 것은 low-rate region에서 이미 distortion이 빠르게 감소했는데, 나머지 부분에도 rate가 할당되어 있다는 것이다. 즉, lossless codelength의 많은 부분이 사람이 인지하지 못할 정도의 distortion을 설명하는 데 사용된다는 것이다. 다른 말로 하면 샘플의 퀄리티가 높음에도 NLL(negative log likelihood)이 높은 이유는 사람이 인지하지 못할 정도의 차이(imperceptible distortion) 때문이기 때문에 큰 문제가 없다는 것이다. 저자들은 이를 보고 DDPM 모델은 lossy compressor가 될 수밖에 없는 inductive bias가 있다고 말한다.

Progressive generation

여기서는 위에서 정의한 $\hat{\mathbf{x}_ 0}$를 시각화하였다. Large scale image features가 먼저 나타나고 details가 나중에 더해지는 것을 볼 수 있다.

그리고 stochastic prediction $\mathbf{x}_ 0 \sim p_\theta(\mathbf{x}_ 0 \vert \mathbf{x}_ t)$를 시각화하였다. $t$가 클 때는 large scale feature만 동일하고, 더 많은 다양성이 있음을 알 수 있다.

Connection to autoregressive decoding

Variational bound 식을 잘 변형하면 다음과 같이 쓸 수 있다고 한다.

\[L = D_{KL} (q(\mathbf{x}_ T) \Vert p(\mathbf{x}_ T)) + \mathbb{E}_ q \left[\sum_ {t \geq 1} D_{KL}(q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) \Vert p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t)) \right] + H(\mathbf{x}_ 0)\]

이때 $D_{KL}(q(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t) \Vert p_\theta(\mathbf{x}_ {t-1} \vert \mathbf{x}_ t))$ 식을 보면 $p_\theta$는 $t+1, \cdots, T$의 정보로부터 $\mathbf{x}_ t$를 autoregressive하게 예측하는 것과 같다. 즉, reverse process는 autoregressive decoding으로 해석할 수 있다.

4.4. Interpolation

Diffusion model이 latent variable model이므로, latent space에서의 interpolation이 가능해야 한다. 먼저 $\mathbf{x}_ 0, \mathbf{x}_ 0 ^ \prime \sim q(\mathbf{x}_ 0)$를 얻고, encoder를 통해 $\mathbf{x}_ t, \mathbf{x}_ t ^ \prime \sim q(\mathbf{x}_ t \vert \mathbf{x}_ 0)$을 얻는다. 이를 interpolation하여 $\bar{\mathbf{x}}_ t = (1-\lambda) \mathbf{x}_ t + \lambda \mathbf{x}_ t ^ \prime$를 얻는다. 이를 다시 decoder에 넣어 $\bar{\mathbf{x}}_ 0 \sim p(\mathbf{x}_ 0 \vert \bar{\mathbf{x}}_ t)$를 얻는다. 이를 시각화하면 다음과 같다.

위 이미지는 $t=500$일 때의 설정이고, Appendix를 보면 다양한 $t$에서의 interpolation 결과를 볼 수 있다. 따라서 fine granularity, coarse granularity에서의 interpolation이 모두 가능하다.

💡 Summary

Diffusion의 기본이 되는 중요한 논문이다.
기존의 diffusion model을 재해석하여 DDPM으로 높은 sample quality 성능을 보였다.
수식을 통해 diffusion model이 score-based model과 동일하며, autoregressive decoding과도 연결됨을 보였다.
높은 NLL 문제를 progressive coding으로 정당화하였다.

📃 Reference

[21’ ICLR] DDPM: Denoising Diffusion Probabilistic Models

I'm rubatoyeong