[22’ PMLR] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Date: 2024.04.02 Updated: 2024.04.03

카테고리: Generation

0. Abstract

Diffusion model을 활용하여 text-conditional image synthesis를 수행하는 GLIDE(Guided Language to Image Diffusion for Generation and Editing)를 제안한다.
CLIP guidance와 classifier-free guidance라는 두 가지 guidance strategies를 제안하고, CFG가 photorealism 및 caption similiarity에서 더 좋은 human evaluation을 받았다.
GLIDE fine tuning을 통해 image inpainting, text-driven image editing 등도 가능하다.

1. Introduction

GLIDE는 diffusion model을 활용하여 T2I(Text-to-Image) synthesis를 수행하는 모델들의 조상격에 있는 모델이다. 저자들은 diffusion model을 잘 사용하여 아래와 같은 task를 구현할 수 있음을 보였다.

Text-conditional Image Synthesis
Text-conditional Image Inpainting
- Text-guided inpainting task
- series of text-guided inpainting tasks
- Sketches to realistic images w/ SDEdit

2. Text-conditional Image Synthesis

2.1. CLIP Guidance

CLIP model의 image encoder를 $f(x)$, caption encoder를 $g(c)$라고 하자. CLIP은 true pair $(x, c)$에 대해 $f(x) \cdot g(c)$를 최대화하고, false pair에 대해서는 최소화하는 방향으로 학습된다. 저자들은 이러한 vanilla CLIP 대신 noised CLIP을 훈련시킨다. 즉 $x$ 대신 diffusion model에서 나오는 noised image $x_t$를 사용하여 image encoder $f(x_t, t)$를 훈련시킨다. 이렇게 훈련하는 이유는 CLIP을 이용하여 모든 diffusion step에 CLIP guidance를 적용하기 위함이다.

CG(Classifier Guidance)에서는 다음과 같이 class label $y$를 이용하여 classifier guidance를 적용했다.

\[\hat{\mu}_ \theta (x_t \vert y) = \mu_ \theta (x_t \vert y) + s \cdot \Sigma_ \theta (x_t \vert y) \nabla _ {x_t} \log p_\phi (y \vert x_t)\]

이와 동일한 방식으로 caption $c$를 이용하여 CLIP guidance를 적용하는 것도 가능하다. 이 식을 계산하기 위해 noised CLIP을 학습시키는 것이다.

\[\hat{\mu}_ \theta (x_t \vert c) = \mu_ \theta (x_t \vert c) + s \cdot \Sigma_ \theta (x_t \vert c) \nabla _ {x_t} (f(x_t) \cdot g(c))\]

2.2. Classifier-free Guidance

마찬가지로 text condition을 이용한 CFG(Classifier-free Guidance)도 가능하다. CG보다 CFG가 좋았던 것처럼 본 논문에서도 CFG가 photorealism 및 caption similiarity에서 더 좋은 human evaluation을 받게 된다. Caption $c$는 tokenize되어 Transformer를 거쳐 token embedding이 되고, 이는 diffusion model에서 class embedding 및 attention layer에 전달된다. 이를 통해 다음과 같이 CFG를 적용할 수 있다.

\[\hat{\epsilon} _ \theta (x_t \vert c) = \epsilon _ \theta (x_t) + s \cdot (\epsilon _ \theta (x_t \vert c) - \epsilon _ \theta (x_t))\]

3. Results

3.1. Comparision with SOTA

결과는 너무 좋다. 주목할 점은 다른 모델과는 달리 zero-shot으로도 좋은 성능을 보인다는 점, 그리고 특정 화풍을 따라하거나 여러 컨셉을 결합하여 이해하는 등 복잡한 상황도 잘 이해하여 이미지를 생성한다는 점이다.

3.2. Comparision between Guidance Methods

Diversity-fidelity trade-off를 보면 CFG가 CLIP guidance보다 더 좋은 결과를 보인다. 다만 Figure 6(c)를 보면 CLIP score 측면에서 보면 CLIP guidance가 더 좋은 결과를 보이는 것처럼 보인다. 저자들은 이것이 CLIP guidance가 evaluation CLIP model에 adversarial examples를 생성하기 때문이라고 하였다. 이해는 간다. CLIP guidance가 CLIP score가 높은 것은 당연한 일이다. 오히려 y축을 기준으로 읽어서 동일한 FID에서 CLIP guidance의 CLIP score가 더 높다고 해석하는 것이 옳지 않나 하는 생각이 든다.

Photorealism 및 caption similiarity에서 human evaluation 결과 역시 CFG가 더 좋은 결과를 보였다. 저자들은 CLIP score가 좋은 metric이 아니라고 주장한다.

3.3. Limitations

이러한 모델임에도 불구하고 한계도 있다. Text를 unusual하게 주는 경우 제대로 그리지 못한다는 것이다. 8개의 다리가 있는 고양이나, 삼각형 바퀴가 달린 자동차 등이 이에 해당한다. 그리고 당연히 diffusion model의 치명적인 단점으로 GAN과 같은 기존 방법보다 현저히 느리다.

💡 Summary

GLIDE(Guided Language to Image Diffusion for Generation and Editing)는 diffusion model을 활용하여 text-conditional image synthesis를 수행하는 모델의 시초 중 하나이다.
CLIP을 이용한 CG(Classifier Guidance)보다 Transformer embedding을 이용한 CFG(Classifier-free Guidance)가 더 좋은 결과를 보인다.
Unusual prompt에 대해 제대로 그리지 못하고, 샘플링이 느리다는 단점이 있다.
Opinion: CLIP embedding을 CFG 방법으로 사용하는 것도 비교해주었으면 좋았을 것 같다. 그래야 더 정확한 비교가 가능하지 않은가? 어쨌든 T2I 분야에서 굉장히 중요한 논문임은 분명하다.

📃 Reference

[22’ PMLR] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

I'm rubatoyeong