[CS236n] 8. Score Based Models (1)

Date: 2024.03.18 Updated: 2024.03.18

카테고리: CS236n

태그: Diffusion Generation

💡 이 글은 『2024 YAI 봄전반기 생성모델팀』으로 진행되었으며, CS236n Fall 2023를 따라 정리했습니다.

1. Score-based Models

1.1. Introduction

Deep energy-based models(EBMs)에서 다음과 같이 score function을 정의했다.

\[s_{\theta}(x) = \nabla_x \log p_{\theta}(x) = \nabla_x f_{\theta}(x)\]

그리고 식의 전개를 통해, score matching loss를 다음과 같이 계산했다.

\[\mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_ {x \sim p_{data}} \left[ \frac{1}{2} \Vert \nabla_x \log p_{\theta}(x) \Vert ^ 2 _ 2 + tr \left( \nabla_x ^2 \log p_{\theta}(x) \right) \right]\]

이러한 방법은 사실 EBM에만 한정되는 것이 아니다. 그 외에도 probability를 계산할 수 있는 autoregrssive model, flow-based model 등에도 적용할 수 있다.

그렇다면 이것들을 모두 포함하는, general model은 무엇인가? 이를 score-based model이라고 하고, score-based model의 특징은 gradient의 vector field를 모델링한다는 것이다. 즉, score function $s_{\theta}(x) \approx \nabla_x \log p_{\theta}(x)$를 목표로 한다.

그러나 우리에게는 $p_{\theta}(x)$에서 샘플링한 데이터 $n$개만 주어져 있다. 이로부터 어떻게 $\nabla_x \log p_{\theta}(x)$를 추정할 수 있을까? (여기서 i.i.d.는 independent and identically distributed의 약자로, 생성모델에서의 데이터셋의 기본 가정이다.)

1.2. Score Matching

지금까지의 상황을 정리해보자.

Given: i.i.d. samples $\lbrace x_1, x_2, \cdots, x_n \rbrace \sim p_{data}(x)$
Task: $\nabla_x \log p_{\theta}(x)$ 추정
Score Model: $s_{\theta}(x): \mathbb{R}^ d \rightarrow \mathbb{R}^ d$
Goal: $s_{\theta}(x) \approx \nabla_x \log p_{\theta}(x)$

따라서 목표를 달성하기 위해서는 vector field $s_{\theta}(x)$와 $\nabla_x \log p_{\theta}(x)$를 비교할 수 있는 방법이 필요하다. 가장 쉽게 생각해볼 수 있는 방법은 Average Euclidean Distance를 측정하는 것이다.

이는 다음 식을 최소화하겠다는 것이고, 아래 식은 EBM에서 본 Fisher divergence와 동일하다.

\[\frac{1}{2} \mathbb{E}_ {x \sim p_{data}} \left[ \Vert \nabla_x \log p_{\theta}(x) - s_{\theta}(x) \Vert ^ 2 _ 2 \right]\]

그리고 $\nabla_x \log p_{\theta}(x)$를 계산할 수 없으므로, score matching을 통해 다음과 같이 나타냈다.

\[\mathbb{E}_ {x \sim p_{data}} \left[ \frac{1}{2} \Vert s_{\theta}(x) \Vert ^ 2 _ 2 + tr \left( \nabla_x s_{\theta}(x) \right) \right]\]

그러나 score matching에서 $tr \left( \nabla_x s_{\theta}(x) \right)$ 항은 굉장히 계산이 오래 걸린다. 예를 들어 다음과 같은 score function을 생각해보자.

그리고 다음 3개의 $tr \left( \nabla_x s_{\theta}(x) \right)$를 계산해야 한다.

예를 들어, $\frac {\partial s_{\theta, 1}(x)}{\partial x_1}$을 계산하기 위해서는 다음과 같은 backpropagation을 거친다.

이를 $O(d)$번 (d: dimension of $x$) 진행하여 $tr \left( \nabla_x s_{\theta}(x) \right)$를 계산한 뒤에는, 이 값을 다시 loss에 넣고 backpropagation을 진행해야 한다. 이러한 계산은 굉장히 비효율적이다. 이를 dimension이 늘어난 고차원 데이터에 대해서는 진행하기 어렵다고 해서 “not scalable”하다고 한다. 이를 해결하기 위하여 denoising score matching과 sliced score sampling이 제안되었다.

2. Denoising Score Matching

2.1. Perturbed distribution

다음과 같은 perturbed distribution $q_\sigma (\tilde{x})$를 정의하자.

\[q_\sigma (\tilde{x} \vert x) = \mathcal{N}(\tilde{x} \vert x, \sigma^2 I) \quad q_\sigma (\tilde{x}) = \int p(x) q_\sigma (\tilde{x} \vert x) dx\]

나중에 결론을 통해 알게 되겠지만, $\nabla_x \log p(x)$를 추정하는 것보다 $\nabla_{\tilde{x}} \log q_\sigma (\tilde{x})$를 추정하는 것이 더 쉽다. 그렇다면 $\nabla_{\tilde{x}} \log q_\sigma (\tilde{x})$를 추정하고, noise level $\sigma$가 충분히 작다면 $q_\sigma (\tilde{x}) \approx p(x)$이므로 좋은 추정이 된다.

따라서 우리의 목표는 다음 식을 최소화하는 것이다.

\[\frac{1}{2} \mathbb{E}_ {\tilde{x} \sim q_\sigma} \left[ \Vert \nabla_{\tilde{x}} \log q_\sigma (\tilde{x}) - s_{\theta}(\tilde{x}) \Vert ^ 2 _ 2 \right]\]

위 식은 전개 및 수학적인 식 정리를 통해 다음과 같이 나타낼 수 있다. (과정은 생략하였으나, 강의자료에는 수록되어 있으므로 궁금하신 분들은 CS236n Lecture 13 강의자료를 참고해주시기 바랍니다.)

\[\frac{1}{2} \mathbb{E}_ {x \sim p_{data}(x), \tilde{x} \sim q_\sigma (\tilde{x} \vert x)} \left[ \Vert s_ \theta (\tilde{x}) - \nabla_{\tilde{x}} \log q_\sigma (\tilde{x} \vert x) \Vert ^ 2 _ 2 \right] + \text{const.}\]

그런데 $q_\sigma (\tilde{x} \vert x) = \mathcal{N}(\tilde{x} \vert x, \sigma^2 I)$, 즉 simple gaussian distribution이므로, $\nabla_{\tilde{x}}$를 취하면 다음과 같다. 절대 구하지 못할 것 같았던 $\nabla_{x} \log p(x)$를 여러 변환 과정을 통해 다음과 같이 간단한 식으로 나타낸 것이다.

\[\nabla_{\tilde{x}} \log q_\sigma (\tilde{x} \vert x) = - \frac{\tilde{x} - x}{\sigma^2}\]

이러한 방법의 장점은 고차원의 데이터에 대해서도 최적화가 효과적이라는 것이다. 그러나 단점은 noise-free score function $s_{\theta}(x)$를 추정하는 것이 아니라, noisy score function $s_{\theta}(\tilde{x})$를 추정한다는 것이다.

2.2. Denoising Score Matching

실제 과정은 다음과 같다.

Datapoints $\lbrace x_1, x_2, \cdots, x_n \rbrace \sim p_{data}(x)$를 샘플링한다.
Perturbed datapoints $\lbrace \tilde{x}_ 1, \tilde{x}_ 2, \cdots, \tilde{x}_ n \rbrace \sim q_\sigma (\tilde{x})$를 샘플링한다. 이때 $\tilde{x}_ i \sim q_\sigma (\tilde{x} \vert x_i)$이다.
Denoising score matching loss를 다음과 같이 추정한다.

\[\frac{1}{2n} \sum_ {i=1} ^ {n} \left[ \Vert s_ \theta (\tilde{x}_ i) - \nabla _ {\tilde{x}} \log q_\sigma (\tilde{x}_ i \vert x_i) \Vert ^ 2 _ 2 \right]\]

만약 perturbation이 Gaussian이라면 다음과 같이 나타낼 수 있다.

\[\frac{1}{2n} \sum_ {i=1} ^ {n} \left[ \left\Vert s_ \theta (\tilde{x}_ i) + \frac{\tilde{x}_ i - x_ i}{\sigma^2} \right\Vert ^ 2 _ 2 \right]\]

이때 충분히 작은 $\sigma$를 선택해야 원래 분포를 잘 추정하는 식이 될 것이다.

2.3. Tweedie’s Formula

이제 이 식의 의미를 해석해보자. Gaussian perturbation에 대하여 다음과 같이 score matching loss를 나타낼 수 있다.

\[\begin{aligned} & \frac{1}{2} \mathbb{E}_ {q_\sigma (\tilde{x})} \left[ \Vert \nabla_{\tilde{x}} \log q_\sigma (\tilde{x}) - s_{\theta}(\tilde{x}) \Vert ^ 2 _ 2 \right] \\ =& \frac{1}{2} \mathbb{E}_ {p(x)} \mathbb{E}_ {q_\sigma (\tilde{x} \vert x)} \left[ \left\Vert \frac{x - \tilde{x}}{\sigma^2} - s_{\theta}(\tilde{x}) \right\Vert ^ 2 _ 2 \right] \\ \end{aligned}\]

예를 들어 왼쪽 사진을 $x$, 오른쪽 사진을 $\tilde{x} = x + \text{noise}$라고 하면 $s_{\theta}(\tilde{x})$는 $x$에 더해진 noise를 예측(즉, $\tilde{x} - x$를 예측)하는 것이 된다. 이러한 관점에서 denoising score matching은 denoising 과정으로 해석할 수 있다.

Tweedie’s formula는 최적의 denoising strategy를 찾는 것은 gradient, 즉 $\nabla _ {\tilde{x}} \log q_\sigma (\tilde{x})$를 따라가는 것이라는 의미의 식으로 다음과 같이 나타낼 수 있다.

\[\text{Denoising}: \tilde{x} + \sigma ^ 2 \nabla _ {\tilde{x}} \log q_\sigma (\tilde{x})\]

Denoising score matching은 이 식과 일맥상통한다. 먼저 Posterior를 다음과 같이 정의할 수 있고,

\[p(x \vert \tilde{x}) \propto p(x) q_\sigma (\tilde{x} \vert x)\]

$q_\sigma (\tilde{x})$는 다음과 같이 정의되었으므로,

\[q_\sigma (\tilde{x}) = \int p(x) q_\sigma (\tilde{x} \vert x) dx\]

$x \sim p(x \vert \tilde{x})$ 샘플링은 일종의 $\tilde{x} \sim q_\sigma (\tilde{x})$ 샘플링과 같고, 따라서 Tweedie’s formula를 다음과 같이 나타낼 수 있다.

\[\begin{aligned} \mathbb{E}_ {x \sim p(x \vert \tilde{x})} \left[ x \right] &= \tilde{x} + \sigma ^ 2 \nabla _ {\tilde{x}} \log q_\sigma (\tilde{x}) \\ &\approx \tilde{x} + \sigma ^ 2 s_{\theta}(\tilde{x}) \end{aligned}\]

다시 한번 $s_{\theta}(\tilde{x})$는 $x$에 더해진 noise를 예측하는 최적의 denoising strategy라는 것을 알 수 있다.

3. Sliced Score Matching

3.1. Projection onto random directions

Sliced score matching에서의 컨셉은 $\nabla_x \log p_{data}(x)$와 $s_{\theta}(x)$를 직접 비교하는 것은 어려우니, 이를 랜덤한 1차원으로 projection하여 비교하자는 것이다.

따라서 랜덤한 벡터 $v$를 설정하고, 그 벡터의 방향으로 projection한 값들을 비교하는 sliced Fisher divergence를 사용한다.

\[\frac{1}{2} \mathbb{E}_ {v \sim p_v} \mathbb{E}_ {x \sim p_{data}} \left[ \left( v^T \nabla_x \log p_{data}(x) - v^T s_{\theta}(x) \right) ^ 2 \right]\]

이를 전개하여 정리하면 다음과 같이 된다.

\[\mathbb{E}_ {v \sim p_v} \mathbb{E}_ {x \sim p_{data}} \left[ v^T \nabla_ x s_\theta(x) v + \frac{1}{2} \left( v^T s_\theta(x) \right) ^ 2 \right]\]

여기서 잠시 기존의 score matching loss와 비교해보자.

\[\mathbb{E}_ {x \sim p_{data}} \left[ \frac{1}{2} \Vert s_{\theta}(x) \Vert ^ 2 _ 2 + tr \left( \nabla_x s_{\theta}(x) \right) \right]\]

여기서는 $tr \left( \nabla_x s_{\theta}(x) \right)$ 이 문제였는데(즉, not scalable했다), 그렇다면 sliced score matching에서는 $v^T \nabla_ x s_\theta(x) v$가 계산에 있어 문제가 있지 않을까? 이를 확인해보자.

3.2. Jacobian-vector products

먼저 $v^T \nabla_ x s_\theta(x) v = v^T \nabla_ x (v^T s_\theta(x))$이다. 그리고 score matching 때와 같은 예시 $s_\theta(x)$를 가져오자.

먼저 $v^T s_\theta(x)$는 다음과 같이 계산하여 one dimensional projection이 가능하고,

Backpropagation도 한 번에 가능하다.

따라서 score matching과는 달리 sliced score sampling은 scalable하다. 대신, 이러한 과정 때문에 denoising score matching보다는 약간 느리다.

3.3. Sliced Score Matching

실제 과정은 다음과 같다.

Datapoints $\lbrace x_1, x_2, \cdots, x_n \rbrace \sim p_{data}(x)$를 샘플링한다.
Random directions $\lbrace v_1, v_2, \cdots, v_n \rbrace \sim p_v$를 샘플링한다.
Sliced score matching loss를 다음과 같이 추정한다.

\[\frac{1}{n} \sum_ {i=1} ^ {n} \left[ v_i^T \nabla_ x s_\theta(x_i) v_i + \frac{1}{2} \left( v_i^T s_\theta(x_i) \right) ^ 2 \right]\]

보통 projection distribution $p_v$는 Gaussian이나 Racemacher를 사용하고, 성능을 향상시키기 위해서는 datapoint 당 projection sampling 수를 늘리면 된다.

실제 결과를 보면, SM(score matching)에 비해 SSM(sliced score matching)은 data dimension이 증가해도 속도가 크게 변하지 않는다는 것을 알 수 있다. 성능은 SM에 비해 크게 뒤쳐지지 않는다.

4. Langevin Dynamics

Denoising score matching과 sliced score matching은 score-based model의 score function을 추정하는 방법이다. 이러한 방법으로 score function을 추정하였다면, 이제 실제 샘플링을 어떻게 할 것인가에 대해 생각해보자.

4.1. Langevin dynamics sampling

가장 쉬운 방법은 랜덤하게 datapoint를 만들고, 그 point가 score function을 따라 high likelihood가 있을 만한 곳으로 움직이게 하는 것이다. 식으로 나타내면 다음과 같다.

\[\tilde{x}_ {t+1} \leftarrow \tilde{x}_ t + \frac{\epsilon}{2} s_{\theta}(\tilde{x}_ t)\]

그러나 이러한 방법은 생성의 다양성이 떨어지기에 random noise $z_t$를 추가하여 다음과 같은 Langevin MCMC를 사용한다. 이때 $z_t \sim \mathcal{N}(0, I)$이다.

\[\tilde{x}_ {t+1} \leftarrow \tilde{x}_ t + \frac{\epsilon}{2} s_{\theta}(\tilde{x}_ t) + \sqrt{\epsilon} z_t\]

여기서 $\epsilon \rightarrow 0$, $T \rightarrow \infty$로 가면, $\tilde{x} _T \sim p(x)$가 된다. 굉장히 합리적인 것처럼 보이지만, 실제 결과는 다음과 같다.

실제로는 Langevin dynamics sampling이 작동하지 않는다. 그 이유는 무엇일까?

4.2. Pitfalls

그 이유는 manifold hypothesis로 설명할 수 있다. Manifold hypothesis는 high-dimensional data는 대부분 low-dimensional manifold에 분포되어 있어 남은 공간은 비어있다는 가설이다.

예를 들어 CIFAR-10 데이터셋의 경우 원래 $32 \times 32 \times 3 = 3072$차원이지만, PCA(Principal Component Analysis)를 통해 $2165$차원으로 축소하여도 거의 변화가 없다.

따라서 data density가 0인 공간에서는 $\nabla _x \log p(x)$를 계산하는 것이 무의미해진다. 아래 그림에서도 data density가 있는 곳 근처에서는 score function이 잘 작동하지만, 그렇지 않은 곳에서는 정확하지 않은 것을 알 수 있다. 즉, Langevin MCMC는 low density region에서는 잘 작동하지 않는다.

두 번째 문제는 Langevin dynamics은 data modes 간의 mixing이 잘 되지 않는다는 것이다. 예를 들어 다음과 같은 두 distribution이 있다고 하자. 이때 $\mathcal{A} \cap \mathcal{B} = \varnothing$이다.

\[p_{data}(x) = \begin{cases} \pi p_1(x) & x \in \mathcal{A} \\ (1 - \pi) p_2(x) & x \in \mathcal{B} \end{cases}\]

따라서 $\nabla_x \log p_{data}(x) $는 다음과 같이 나타낼 수 있다.

\[\nabla_x \log p_{data}(x) = \begin{cases} \nabla_x \log p_1(x) & x \in \mathcal{A} \\ \nabla_x \log p_2(x) & x \in \mathcal{B} \end{cases}\]

따라서 학습한 score function은 각 모드의 weighting $\pi$를 전혀 고려하지 않는다. 즉, Langevin sampling은 $\pi$를 반영하지 못한다. 그 결과로 실제 i.i.d. sample에 비해 Langevin dynamics sample은 data modes 간의 mixing이 잘 되지 않는다.

이러한 문제를 해결하고 나면 다음과 같이 좋은 생성 결과를 얻을 수 있다(출처). 해결 방법에 대해서는 다음 장에서 다루겠다.

5. Summary

[CS236n] 8. Score Based Models (1)에서는 score-based model에 대해 다루었다. score-based model은 gradient의 vector field를 모델링한다는 것이 특징이다. 이를 통해 score function을 추정하는 방법으로 denoising score matching과 sliced score matching을 다루었다. 그리고 Langevin dynamics sampling을 통해 실제 샘플링을 어떻게 할 것인가에 대해 다루었다. 그러나 Langevin dynamics sampling은 manifold hypothesis와 data modes 간의 mixing 문제로 인해 작동하지 않는다는 것을 알 수 있었다. 이러한 문제를 해결하기 위해 다음 장에서는 annealed Langevin dynamics, 더 나아가서 SDE(Stochastic Differential Equation)에 대해 다룬다.

I'm rubatoyeong