[CS236n] 7. Energy Based Models

Date: 2024.03.17 Updated: 2024.03.18

카테고리: CS236n

💡 이 글은 『2024 YAI 봄전반기 생성모델팀』으로 진행되었으며, CS236n Fall 2023를 따라 정리했습니다.

1. Introduction

지금까지 공부했던 모델들 중 autoregressive models, latent variable models, normalizing flow model은 모델의 구조가 제한적이었고, GAN은 학습이 어려웠다. 반면, EBM(energy-based models)은 모델의 구조가 자유롭고, 학습이 안정적이다. 또한, 샘플의 퀄리티가 좋고, 여러 응용이 가능하다. 이번 강의에서는 EBM에 대해 알아본다.

먼저 확률분포 $p_\theta(x)$를 다시 한번 살펴보자. 확률분포는 두 가지 조건을 만족해야 한다.

non-negative: $p_\theta(x) \geq 0$
sum-to-one: $\sum _ x p_\theta(x) = 1$

그렇다면 함수 $g_\theta(x)$를 이용해 확률분포를 만들기 위해서는 $g_\theta(x) \geq 0$이도록 함수를 만든 뒤, 다음과 같이 $p_\theta(x)$를 정의하면 된다.

\[p_\theta(x) = \frac{1}{Z(\theta)} g_\theta(x) = \frac{1}{\int g_\theta(x)dx} g_\theta(x) = \frac{1}{\text{Volume}(g_\theta)} g_\theta(x)\]

만약 $g_\theta(x)$를 다음과 같이 설정하면, 우리는 volume을 analytic하게 계산할 수 있다. (즉, 식으로 나타낼 수 있다.)

Gaussian $g_ {\mu, \sigma} (x) = e^ {-\frac{(x - \mu)^2}{2\sigma^2}}$ $\rightarrow$ $Z(\theta) = \sqrt{2\pi\sigma^2}$
Exponential $g_{\lambda}(x) = \lambda e^ {-\lambda x}$ $\rightarrow$ $Z(\theta) = \frac{1}{\lambda}$
Exponential Family $g_{\theta}(x) = h(x) \exp \lbrace \theta \cdot T(x) \rbrace $ $\rightarrow$ $Z(\theta) = \exp \lbrace A(\theta) \rbrace$, 이때 $A(\theta) = \log \int h(x) \exp \lbrace \theta \cdot T(x) \rbrace dx$이다.

Exponential family가 중요한 이유는 Gaussian, Poisson, Exponential, Bernoulli 등 다양한 분포들이 Exponential family에 속하기 때문이다. 그러나 이러한 분포들은 실제 데이터의 분포를 묘사하기에는 너무나 제한적이다. 우리는 energy-based model이 더 자유로운 분포를 만들기를 원하고, 그렇다면 $Z(\theta)$의 analytic solution을 포기해야 한다.

2. Energy Based Models

2.1. Definition

EBM의 확률분포 $p_\theta(x)$는 다음과 같이 정의된다.

\[p_\theta(x) = \frac{1}{Z(\theta)} \exp \left( f_\theta(x) \right)\]

이때 $Z(\theta)$는 volume, normalization constant, 혹은 partition function라고 하며 다음과 같이 정의된다.

\[Z(\theta) = \int \exp \left( f_\theta(x) \right) dx\]

$f_\theta(x) ^ 2$와 같은 식도 가능했을 텐데, 왜 하필이면 $\exp(f_\theta(x))$를 사용하는가? 여기에는 3가지 이유가 있다.

$\exp$ 함수는 확률의 큰 변화(large variations)를 포착할 수 있다.
Exponential family를 보면, 세상의 많은 분포는 이러한 형태이다.
통계물리학에서 다루는 energy function으로 사용할 수 있다. 위와 같이 정의하면 $-f_\theta(x)$는 energy로 해석할 수 있고, 낮은 에너지를 가지는 $x$가 높은 확률(혹은 likelihood)을 가지게 된다.

그러나 이러한 EBM에는 크게 3가지 문제가 있다.

$p _ \theta(x)$에서 sampling을 진행하기 어렵다.
likelihood $p _ \theta(x)$를 계산하고 최적화하기 어렵다.
$Z(\theta)$를 계산할 때(numerical, not analytical), 그 계산량은 $x$의 차원에 exponential하게 증가한다.

그러나 일부 task에서는 $Z(\theta)$를 계산하지 않아도 된다. 샘플 $x, x^{\prime}$에 대해 likelihood ratio를 계산하면 다음과 같다.

\[\frac{p_\theta(x)}{p_\theta(x^{\prime})} = \exp \left( f_\theta(x) - f_\theta(x^{\prime}) \right)\]

따라서, 두 샘플의 likelihood를 비교하여 무엇이 더 높은 확률을 가지는지 비교할 수 있고, anomaly detection이나 denoising 과정에 사용할 수 있다. 강의에서는 아래와 같은 예시를 소개하고 있다.

2.2. Applications

이제 EBM을 사용한 여러 모델들을 살펴보며 EBM이 어떻게 이용되는지 확인하자.

Ising Model

Ising model은 원래 통계역학에서 상전이 문제를 해결하기 위해 도입된 모델이다. 여기서는 denoising을 위하여 사용한다. True image $y \in \lbrace 0, 1 \rbrace ^ {3 \times 3}$과 corrupted image $x \in \lbrace 0, 1 \rbrace ^ {3 \times 3}$가 있다고 하자. 우리는 $x$로부터 $y$를 복구하기를 원한다. 즉, $p(y \vert x)$, 혹은 $p(y, x)$를 최대화하고자 한다. 이때 $p(y, x)$를 다음과 같이 모델링한다.

\[p(y, x) = \frac{1}{Z} \exp \left(\sum_ i \psi_ i (x_i, y_i) + \sum_ {(i,j) \in E} \psi_ {ij} (y_i, y_j) \right)\]

두 항은 각각 다음과 같은 것을 모델링한 것이다.

$\psi_ i (x_i, y_i)$: $x_i$에 의해 $y_i$가 영향을 받는다는 것을 의미한다.
$\psi_ {ij} (y_i, y_j)$: 인접한 픽셀들 사이는 유사한 값을 가져야 한다는 것을 의미한다.

이때 $p(y,x)$를 최대화하여 원하는 복구된 이미지 $y$를 얻을 수 있다.

Product of Experts

이미 학습된 여러 모델 $q_{\theta_ 1}(x)$, $r_{\theta_ 2}(x)$, $t_{\theta_ 3}(x)$가 있다고 하자. 이때 이들을 product of experts로 결합하여 다음과 같이 모델링할 수 있다.

\[p_ {\theta_1, \theta_2, \theta_3}(x) = \frac{1}{Z(\theta_1, \theta_2, \theta_3)} q_{\theta_1}(x) r_{\theta_2}(x) t_{\theta_3}(x)\]

일반적인 mixture model은 여러 모델들을 OR 연산으로 계산하는 것과 달리 이 모델은 AND 연산으로 계산한다. 이를 이용해 여러 조건에 맞는 이미지를 생성해낼 수 있다. (Du et al., 2020)

Restricted Boltzmann Machine (RBM)

RBM은 latent variable을 도입한 EBM이라고 생각하면 된다. Pixel values와 같은 visible variable을 $x \in \lbrace 0, 1 \rbrace ^ n$이라고 하고, latent variable을 $z \in \lbrace 0, 1 \rbrace ^ m$이라고 하자. 그리고 다음과 같이 joint distribution을 모델링한다.

\[\begin{aligned} p_{W, b, c}(x, z) &= \frac{1}{Z} \exp \left( x^T W z + bx + cz \right) \\ &= \frac{1}{Z} \exp \left( \sum _ {i, j} x_i z_j w_ {ij} + \sum _ i x_i b_ i + \sum _ j z_j c_ j \right) \end{aligned}\]

“Restricted”인 이유는 visible variable과 latent variable 내부에는 연결이 없기 때문이다. 즉 $x_i x_j$, $z_i z_j$의 term은 없다. 그리고 이것을 쌓은 stacked RBM, 혹은 DBM(Deep Boltzmann Machine)은 최초의 deep generative model 중 하나이다.

2009년에 발표된 Hinton의 논문에서 DBM을 이용해 데이터를 생성한 모습이다. 2009년이라고는 믿기지 않을 정도로 인상적이다.

3. Training with Sampling

3.1. Contrastive Divergence

학습의 목표는 다시 likelihood $p_\theta(x)$를 최대화하는 것이다. 다시 말하면 $\frac {\exp \left( f_\theta(x_{\text{train}}) \right)}{Z(\theta)}$를 최대화하는 것이다. 그러나 $Z(\theta)$를 계산하기 어렵기 때문에, contrastive divergence라는 아이디어를 사용한다.

핵심은 $p_\theta(x)$ 대신 계산하기 쉬운 $f_\theta(x)$를 최대화하고 싶은데, $x_{\text{train}}$만을 사용해서 $f_\theta(x_{\text{train}})$만 최대화한다고 $x_{\text{train}}$의 실제 likelihood가 올라가는 것이 아니라는 것이다. 왜? $f_\theta(x_{\text{train}})$은 un-normalized log-probability이기 때문이다. 즉 $f_\theta(x_{\text{train}})$의 값을 올릴 때마다 $Z(\theta)$의 값도 커지기 때문에 그 likelihood가 정확하지 않다. 그래서 contrastive divergence에서는 wrong points, 즉 $p_\theta$에서 샘플링한 $x_{\text{sample}}$를 사용하여 $f_\theta(x_{\text{sample}})$를 최소화하는 것도 동시에 진행한다. 즉 다음과 같은 step을 진행하는 것이다. 이를 통해 $Z(\theta)$의 값을 일정하게 유지할 수 있다.

\[\nabla_ \theta \left( f_\theta(x_{\text{train}}) - f_\theta(x_{\text{sample}}) \right)\]

이를 수학적으로도 보일 수 있다. 목표 $\max_ \theta \left[ f_\theta(x_{\text{train}}) - \log Z(\theta) \right]$를 다음과 같이 풀어쓸 수 있다.

\[\begin{aligned} \nabla _ \theta \log \frac {\exp \left( f_ \theta(x_{\text{train}}) \right)}{Z(\theta)} &= \nabla _ \theta \left( f_ \theta(x_{\text{train}}) - \log Z(\theta) \right) \\ &= \nabla _ \theta f_ \theta(x_{\text{train}}) - \frac {\nabla _ \theta Z(\theta)}{Z(\theta)} \\ &= \nabla _ \theta f_ \theta(x_{\text{train}}) - \frac {1}{Z(\theta)} \int \nabla _ \theta \exp \left( f_ \theta(x) \right) dx \\ &= \nabla _ \theta f_ \theta(x_{\text{train}}) - \frac {1}{Z(\theta)} \int \exp \left( f_ \theta(x) \right) \nabla _ \theta f_ \theta(x) dx \\ &= \nabla _ \theta f_ \theta(x_{\text{train}}) - \int \frac {\exp \left( f_ \theta(x) \right)}{Z(\theta)} \nabla _ \theta f_ \theta(x) dx \\ &= \nabla _ \theta f_ \theta(x_{\text{train}}) - \mathbb{E}_ {x_{\text{sample}}} \left[ \nabla _ \theta f_ \theta(x_{\text{sample}}) \right] \\ &\approx \nabla _ \theta f_ \theta(x_{\text{train}}) - \nabla _ \theta f_ \theta(x_{\text{sample}}) \\ \end{aligned}\]

이제 문제는 $x_{\text{sample}}$를 어떻게 샘플링할 것인가이다.

3.2. Metropolis-Hastings MCMC

Metropolis-Hastings MCMC는 다음과 같은 방법으로 $x_{\text{sample}}$를 샘플링한다.

Sampling이 쉬운 분포 $\pi(x)$로부터 $x^0$을 샘플링한다.
아래 과정을 $t = 0, 1, 2, \cdots, T-1$까지 반복한다.
1. $x^{\prime} = x^t + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2)$
2. $f_\theta(x^{\prime}) \geq f_\theta(x^t)$이면 $x^{t+1} = x^{\prime}$로 설정한다.
3. $f_\theta(x^{\prime}) \lt f_\theta(x^t)$이면 $\exp \left( f_\theta(x^{\prime}) - f_\theta(x^t) \right)$의 확률로 $x^{t+1} = x^{\prime}$로 설정한다. 그렇지 않으면 $x^{t+1} = x^t$로 설정한다.

이러한 MCMC(Markov Chain Monte Carlo) 방법을 사용하면 $T \rightarrow \infty$에서는 $x^T$가 $p_\theta(x)$로부터 샘플링한 것과 같아진다. 그러나, 이 방법은 반복 수가 많이 필요하고 $x$의 차원이 증가할수록 수렴 속도가 exponential하게 느려진다는 문제가 있다.

3.3. Unadjusted Langevin MCMC

Langevin MCMC는 다음과 같은 방법으로 $x_{\text{sample}}$를 샘플링한다.

Sampling이 쉬운 분포 $\pi(x)$로부터 $x^0$을 샘플링한다.
아래 과정을 $t = 0, 1, 2, \cdots, T-1$까지 반복한다.
1. $z^t \sim \mathcal{N}(0, I)$
2. $x^{t+1} = x^t + \epsilon \nabla _ x \log p_\theta(x^t) \vert _ {x = x^t} + \sqrt{2\epsilon} z^t$, $\epsilon$는 learning rate이다.

여기서도 $T \rightarrow \infty$, $\epsilon \rightarrow 0$에서는 $x^T$가 $p_\theta(x)$로부터 샘플링한 것과 같아진다. 이때 $\nabla _ x \log p_\theta(x)$를 score function이라고 한다. EBM에서는 $\nabla _ x \log p_\theta(x)$를 계산하는 것이 쉽다. 아래와 같이 $Z(\theta)$를 계산하지 않아도 되기 때문이다. 그러나 이 방법도 $x$의 차원이 증가할수록 수렴 속도가 느려진다는 문제가 있다.

\[\begin{aligned} \nabla _ x \log p_\theta(x) &= \nabla _ x f_\theta(x) - \nabla _ x \log Z(\theta) \\ &= \nabla _ x f_\theta(x) \end{aligned}\]

아래는 Langevin sampling을 이용해 생성된 이미지의 예시이다. 학습이 진행될수록 noise에서 data distribution으로 이동하는 것을 확인할 수 있다.

4. Training without Sampling

EBM을 학습하는 또 다른 방법은 sampling을 사용하지 않는 방법이다. EBM에서 sampling은 굉장히 “비싼” 방법이기 때문에, 이를 피하여 학습하는 방법들이 개발되었다. 여기서는 그 예시로 score matching, noise contrastive estimation(NCE), adversarial training을 살펴본다.

4.1. Score Matching

Langevin MCMC를 보면서 score function을 정의한 바 있다. 여기서는 먼저 (Stein) score function의 의미를 다시 한번 생각해보자.

\[s_\theta(x) := \nabla _ x \log p_\theta(x) = \nabla _ x f_\theta(x)\]

먼저 score function을 다음과 같이 시각화할 수 있다.

즉, score function은 data distribution에서 high density를 가지는 부분을 알려주는 함수이다. (사실 이것은 다변수미적분에서 gradient의 정의와 같으므로 당연한 사실이다.) EBM에서의 핵심은 $s_\theta(x)$가 $Z(\theta)$를 계산하지 않아도 되기 때문에 쉽게 계산할 수 있다는 것이다.

Score matching에서는 다음과 같은 아이디어를 사용한다. $p_{data}(x)$와 $p_\theta(x)$ 사이의 거리를 최소화하는 최적화 방법은 $Z(\theta)$의 계산이 필요하므로 어렵다. 대신, 데이터가 가야 할 방향을 알려주는 $\nabla_ x \log p_{data}(x)$와 $\nabla _ x \log p_\theta(x)$의 차이를 최소화하자는 것이다. 이는 Fisher divergence로 나타낼 수 있는데, 두 분포 간의 Fisher divergence는 다음과 같이 정의된다.

\[D_F (p, q) := \frac {1}{2} \mathbb{E}_ {x \sim p} \left[ \Vert \nabla _ x \log p(x) - \nabla _ x \log q(x) \Vert ^ 2 _ 2 \right]\]

이를 EBM에 적용하면 다음과 같다.

\[\begin{aligned} D_F (p_{data}, p_\theta) &= \frac {1}{2} \mathbb{E}_ {x \sim p_{data}} \left[ \Vert \nabla _ x \log p_{data}(x) - \nabla _ x \log p_\theta(x) \Vert ^ 2 _ 2 \right] \\ \end{aligned}\]

부분적분을 이용하면 다음과 같이 정리할 수 있다. 증명이 궁금하다면 여기를 참고하자.

\[\begin{aligned} D_F (p_{data}, p_\theta) &= \mathbb{E}_ {x \sim p_{data}} \left[\frac {1}{2} \Vert \nabla_x \log p_\theta (x) \Vert ^ 2 _ 2 + tr \left( \nabla ^ 2 _ x \log p_\theta(x) \right) \right] + \text{const.} \\ \end{aligned}\]

따라서 score matching loss를 다음과 같이 $x_i \sim p_{data}(x)$로부터 샘플링하여 추정할 수 있다.

\[\begin{aligned} \mathcal{L}_{\text{SM}}(\theta) &= \frac {1}{n} \sum _ {i=1} ^ n \left[\frac {1}{2} \Vert \nabla_x \log p_\theta (x_i) \Vert ^ 2 _ 2 + tr \left( \nabla ^ 2 _ x \log p_\theta(x_i) \right) \right] \\ &= \frac {1}{n} \sum _ {i=1} ^ n \left[\frac {1}{2} \Vert \nabla _ x f_\theta(x_i) \Vert ^ 2 _ 2 + tr \left( \nabla ^ 2 _ x f_\theta(x_i) \right) \right] \\ \end{aligned}\]

여기서 하나의 문제는 large model에 대해 Hessian($\nabla ^ 2$)을 계산하기 어렵고, 심지어 trace를 계산하는 것도 그 계산량이 매우 크다는 것이다. 이를 개선하기 위해 Denoising score matching (Vincent 2010)과 sliced score matching (Song et al. 2019) 등이 제안되었는데, 이는 score-based model 편에서 다루겠다.

4.2. Noise Contrastive Estimation (NCE)

Noise Contrastive Estimation

Noise contrastive estimation은 noise distribution과 data distribution을 대조(contrast)해서 EBM을 학습하자는 것이다. 여기서 noise distribution $p_n (x)$는 tractable하고, 샘플링이 쉬운 분포로 설정한다. 그리고 data와 noise를 구별하는 discriminator $D_\theta(x) \in \left[ 0, 1 \right]$를 학습시킨다. 학습 목표는 GAN의 discriminator와 같다.

\[\max _ \theta \mathbb{E} _ {x \sim p_{data}} \left[ \log D_\theta(x) \right] + \mathbb{E} _ {x \sim p_n} \left[ \log (1 - D_\theta(x)) \right]\]

GAN 때와 같이, optimal discriminator $D_ {\theta ^*}(x) $는 다음과 같다.

\[D_ {\theta ^*}(x) = \frac {p_{data}(x)}{p_{data}(x) + p_n(x)}\]

우리는 이 식으로부터 역으로 생각하여, discriminator를 다음과 같이 parametrize할 것이다.

\[D_\theta(x) = \frac {p_\theta (x)}{p_\theta (x) + p_n(x)}\]

따라서, discriminator가 잘 학습된다면 내부적으로는 $p_\theta(x) \approx p_{data}(x)$가 된다. 이를 다시 최적의 $p_{\theta ^ *}(x)$에 대해 풀어 쓰면 다음과 같다.

\[p_{\theta ^ *}(x) = \frac {p_n(x) D _ {\theta ^ *}(x)}{1 - D _ {\theta ^ *}(x)} = p_{data}(x)\]

이를 해석하자면 noise distribution $p_n(x)$이 classifier에 의해 $p_{data}(x)$에 가깝게 “교정”되는 것이다. 여기서 noise distribution 대신 generative model의 분포를 사용하여 해당 generative model의 성능을 향상시키는 데에도 사용할 수 있다 (Boosted Generative Models, Grover et al., 2018).

이러한 NCE(Noise Contrastive Estimation)를 EBM에 적용하려면 한 가지 문제를 해결해야 하는데, 아까와 동일하게 $Z(\theta)$를 계산하기 어렵다는 것이다. 해결책은 의외로 간단한데, $Z$도 discriminator로 학습시키는 parameter로 간주하는 것이다. 최종적으로 discriminator의 식은 다음과 같다.

\[D_{\theta, Z}(x) = \frac {\frac {\exp \left( f_\theta(x) \right)}{Z} }{\frac {\exp \left( f_\theta(x) \right)}{Z} + p_n(x)} = \frac {\exp \left( f_\theta(x) \right)}{\exp \left( f_\theta(x) \right) + Zp_n(x)}\]

따라서 학습 목표는 다음과 같고,

\[\max _ {\theta, Z} \mathbb{E} _ {x \sim p_{data}} \left[ \log D_{\theta, Z}(x) \right] + \mathbb{E} _ {x \sim p_n} \left[ \log (1 - D_{\theta, Z}(x)) \right]\]

discriminator의 식을 대입하면 각각 항은 다음과 같다.

\[\begin{aligned} \log D_{\theta, Z}(x) &= f_\theta(x) - \log \left( \exp \left( f_\theta(x) \right) + Zp_n(x) \right) \\ \log (1 - D_{\theta, Z}(x)) &= \log \left(Z p_n(x) \right) - \log \left( \exp \left( f_\theta(x) \right) + Zp_n(x) \right) \end{aligned}\]

여기서 계산의 안정성을 위해 log-sum-exp trick을 사용하는데, 이론적으로는 큰 의미가 없다. 단지 다음과 동일하다고 생각하면 된다.

\[\text{logsumexp}(a, b) = \log \left( \exp(a) + \exp(b) \right)\]

이를 적용하면 다음과 같이 된다.

\[\begin{aligned} \log \left( \exp \left( f_\theta(x) \right) + Zp_n(x) \right) &= \log \left( \exp \left( f_\theta(x) \right) + \exp \left(\log Z + \log p_n(x) \right) \right) \\ &= \text{logsumexp} \left( f_\theta(x), \log Z + \log p_n(x) \right) \\ \end{aligned}\]

따라서 NCE loss는 다음과 같이 계산할 수 있다.

$x_1, x_2, \cdots, x_n \sim p_{data}(x)$를 샘플링한다.
$y_1, y_2, \cdots, y_n \sim p_n(y)$를 샘플링한다.
NCE loss를 구한다.

\[\begin{aligned} \mathcal{L}_{\text{NCE}}(\theta, Z) &= \frac {1}{n} \sum _ {i=1} ^ n \left[ f_\theta(x_i) - \text{logsumexp} \left( f_\theta(x_i), \log Z + \log p_n(x_i) \right) \right] \\ &+ \frac {1}{n} \sum _ {i=1} ^ n \left[ \log Z + \log p_n(y_i) - \text{logsumexp} \left( f_\theta(y_i), \log Z + \log p_n(y_i) \right) \right] \\ \end{aligned}\]

NCE와 GAN은 유사한 점이 많다. 그러나 NCE는 discriminator가 noise distribution과 data distribution을 구별하는 것이 아니라, noise distribution을 data distribution에 가깝게 교정하는 것이다.

Flow Contrastive Estimation

이러한 컨셉을 발전시킨 Flow contrastive estimation (Gao et al. 2020)에 대해서도 소개한다. 아이디어는 noise distribution $p_n(x)$를 flow-based model로 설정하여, noise distribution을 학습하는 것이다. 이를 통해 noise distribution을 더 정교하게 모델링하고, 성능 향상을 꾀할 수 있다. 다른 말로 하면, $p_n(x)$를 최대한 $p_{data}(x)$에 가깝게 하여 discriminator의 classification task를 더욱 어렵게 만들겠다는 것이다. 따라서 noise distribution은 다음과 같이 flow model $p_{n, \phi}(x)$가 되고, discriminator는 다음과 같이 나타낼 수 있다.

\[D_{\theta, Z, \phi}(x) = \frac {\exp \left( f_\theta(x) \right)}{\exp \left( f_\theta(x) \right) + Zp_{n, \phi}(x)}\]

Flow-based model은 $D_{JS} (p_{data}, p_{n, \phi})$를 최소화하는 방향으로 학습된다. 따라서 학습 목표는 다음과 같다.

\[\min _ {\phi} \max _ {\theta, Z} \mathbb{E} _ {x \sim p_{data}} \left[ \log D_{\theta, Z, \phi}(x) \right] + \mathbb{E} _ {x \sim p_{n, \phi}} \left[ \log (1 - D_{\theta, Z, \phi}(x)) \right]\]

결과는 다음과 같다.

4.3. Adversarial Training

Adversarial training의 목표는 VAE에서 ELBO를 구했을 때처럼 log-likelihood의 upper bound를 구하면서, variational distribution $q_\phi(x)$를 도입하여 얻을 수 있다.

\[\begin{aligned} \mathbb{E}_ {x \sim p_{data}} \left[ \log p_\theta(x) \right] &= \mathbb{E}_ {x \sim p_{data}} \left[ f_\theta(x) \right] - \log Z(\theta) \\ &\leq \mathbb{E}_ {x \sim p_{data}} \left[ f_\theta(x) \right] - \log \int e ^ {f_\theta(x)} dx \\ &= \mathbb{E}_ {x \sim p_{data}} \left[ f_\theta(x) \right] - \log \int q_\phi(x) \frac {e ^ {f_\theta(x)}}{q_\phi(x)} dx \\ &\leq \mathbb{E}_ {x \sim p_{data}} \left[ f_\theta(x) \right] - \int q_\phi(x) \log \frac {e ^ {f_\theta(x)}}{q_\phi(x)} dx \\ &= \mathbb{E}_ {x \sim p_{data}} \left[ f_\theta(x) \right] - \mathbb{E}_ {x \sim q_\phi} \left[ f_\theta(x) \right] - H(q_\phi(x)) \\ \end{aligned}\]

$\theta$의 목적은 log-likelihood를 최대화하는 것이고, $\phi$의 목적은 upper bound가 최대한 log-likelihood에 가까워지는 것이므로 학습의 목표는 다음 minimax game이다.

\[\max _ \theta \min _ \phi \mathbb{E}_ {x \sim p_{data}} \left[ f_\theta(x) \right] - \mathbb{E}_ {x \sim q_\phi} \left[ f_\theta(x) \right] - H(q_\phi(x))\]

5. Summary

[CS236n] 7. Energy Based Models에서는 EBM에 대해 정리하였다. EBM은 flexible model이지만, intractable partition function을 가지고 있어 이를 해결하는 것이 큰 문제였다. 첫 번째 해결 방법은 contrastive divergence로 유도한, MCMC를 이용한 sampling이고, 두 번째 해결 방법은 sampling을 하지 않는 방법으로, score matching, noise contrastive estimation, adversarial training 등이 있다. 여기에서 등장한 score function의 개념은 다음 장에서 정리할 score-based model과 아주 밀접한 관련이 있다.

I'm rubatoyeong