[24’ CVPR] PixelLLM: Pixel Aligned Language Models

Date: 2024.06.25 Updated: 2024.06.25

카테고리: Multimodal

🔍 Abstract

Google Internship 도중 주도적으로 진행한 연구가 CVPR 2024에 억셉된다니 놀라운 일이다. PixelLLM은 LLM의 localization을 더 강화하기 위한 연구로 볼 수 있다. LLM이 내뱉는 단어 토큰마다 reference pixel을 같이 나타내어 각 단어마다 pixel-aligned image captioning이 가능하다는 것이 이 모델의 특징이다. 그 외에도, downstream task인 referring localization, location-conditioned captioning, dense object captioning도 가능하다. 처음에 이 Figure를 보고 기존 LLM에 어떤 Visual Attention Map을 계산하여 시각화하는 새로운 방법론이 아닌가 기대했지만, 그렇지는 않다. 기존에 이러한 pixel-aligned image captioning을 수행했던 데이터셋이 구글 측에 있었고, 이를 LLM에 활용한 것임을 미리 밝혀둔다. 사실 Visualization이 정말 사람을 홀리게 하는 매력이 있다는 것이 가장 큰 장점일 것이다. 기회가 된다면 한번 보기를 추천한다: PixelLLM.

1. Introduction

ECCV 2020에서 Google Research의 Localized Narratives (LN) 데이터셋이 발표되었다. 이는 사람이 프레젠테이션을 할 때 포인터의 움직임과 설명을 모아둔 데이터셋으로, 이를 통하여 Pixel Aligned Image Captioning을 수행할 수 있었다. 굉장히 독특한 데이터셋이라고 볼 수 있다. 이는 기존의 bounding box 위주의 데이터셋에 비해 훨씬 풍부한 정보를 제공한다는 이점이 있으나, 적용에 한계가 있어 지금까지 많이 사용되지는 않았다. PixelLLM은 이러한 데이터셋을 LLM에 적용하여 학습시켜 재미있는 결과를 얻어낸다.

2. Architecture

2.1. Overall Architecture

Architecture를 정리해보면 다음과 같다.

Image Encoder $\mathcal{V}$: SAM Encoder ViT-H (Frozen) + EVA02 Pre-trained ViT-H (Trainable) 두 개를 사용하고, Prompt feature extractor에 보내기 전 channel-wise concatenation을 수행한다. 이러한 방식으로 SAM의 정보를 보존하면서도 새로운 semantic feature를 학습하도록 했다. 이를 통해 image feature $f = \mathcal{V} (I) \in \mathbb{R}^ {N \times C}$를 얻는다.
Prompt Encoder $\mathcal{P}$: SAM의 Prompt Encoder를 그대로 사용하였다. 이때 Prompt는 bounding box $b \in \mathbb{R}^ 4$로 제한하고, 따로 없다면 전체 이미지를 사용한다. 이를 통해 location prompt feature $\mathcal{P} (b)$를 얻는다.
Prompt Feature Extractor $\mathcal{E}$: location-specific visual feature $f_ l = \mathcal{E} (f, \mathcal{P} (b)) \in \mathbb{R}^ {N \times C}$를 얻는다. 이를 얻는 방법은 Q-Former와 유사한 방식으로 two-way transformer architecture를 사용한다. 이러한 방법은 RoIAlign과 방법론적으로 유사하나, RoIAlign보다 우수한 성능을 보인다.
Text Output: Autoregressive하게 $w_ i = \mathcal{L} (f_ l, w_ {1:i-1})$를 얻는다. 이때 $\mathcal{L}$은 LLM이다.
Dense Location Output: LLM의 last layer를 제거한 나머지 부분을 $\mathcal{L} ^ {-}$라고 하자. 이때 기존 text output은 $w_ i = \arg \max (v \cdot \mathcal{L} ^ {-} (f_ l, w_ {1:i-1}))$로 표현할 수 있고, 동일한 방법으로 pixel location $p_ i = \text{MLP} (\mathcal{L} ^ {-} (f_ l, w_ {1:i-1}))$을 얻도록 한다.

이때 학습의 output이 text와 dense location 두 가지이므로 각각에 대한 loss function을 더하여 학습을 진행하였다.

2.2. Downstream Tasks

저자들은 downstream vision task로 (1) referring localization and segmentation, (2) location-conditioned captioning, (3) dense object captioning을 제안한다. 하나씩 살펴보자.

Referring Localization and Segmentation

이 중 첫 번째 task는 bounding box를 예측하는 것으로, 이제 $p_ i$를 예측하는 것이 아닌 $\hat {b} \in \mathbb{R}^ 4$를 예측하는 것으로 목표가 바뀐다. 이때 마지막 식은 다음과 같이 변형된다.

\[\hat {b} = \text{MLP} (\mathcal{L} ^ {-} (f_ l, [t, <\text{EOS}>])\]

이때 <EOS> token은 bounding box를 예측하는 데 사용될 embedding token이다. 한편 Segmentation은 SAM의 구조를 적극적으로 활용한다. 이미 SAM Encoder로 얻은 $f$를 가지고 있고, bounding box를 예측했다면 $b$도 있을 테니 SAM Prompt Encoder로 $\mathcal{P} (b)$도 얻을 수 있다. 따라서 SAM mask decoder를 가져오기만 하면 Segmentation Map을 생성할 수 있다. 어떻게 보면 강력한 Foundational Model인 SAM의 강점을 잘 adaptation한 것이라고 볼 수 있다.

Location-Conditioned Captioning and Dense Object Captioning

Location-conditioned captioning은 user가 직접 bounding box를 그려서 input으로 넣는 경우를 상정한 것이고, dense object captioning은 CenterNet과 같은 기존 proposal network를 사용하여 생성한 다양한 bounding box를 input으로 넣는 경우를 상정한 것이다. 두 경우 모두 아래 식을 따라 caption sentence $s^ b$를 생성한다.

\[s_ i ^ b = \mathcal{L} (f_ l, s_ {1:i-1} ^ b)\]

3. Experiments

전체 결과는 위와 같이 SOTA를 달성하였다. 개인적으로 referring segmentation 결과에 의문을 품었는데, LISA에 비해 크게 달라진 부분도 없는데 어떻게 성능을 높일 수 있었을지 궁금했다. 고민해본 결과 가장 큰 차이점은 SAM의 prompt encoder와 decoder의 사용 여부라고 생각한다. LISA의 경우 prompt encoder가 Multimodal LLM output 그 자체이고, decoder 또한 따로 fine-tuning하였다. 그러나 PixelLLM은 SAM의 prompt encoder와 decoder를 그대로 사용하였다. 데이터의 수가 제한적인 상황에서 이러한 세팅은 foundational model을 사용하여 얻는 이득을 높였다고 볼 수 있다.

💡 Summary

PixelLLM은 Localized Narratives 데이터셋을 활용하여 LLM을 학습시켜 localization 성능을 강화한 모델이다. 또한, fine-tuning을 통해 downstream task인 referring localization and segmentation, location-conditioned captioning, dense object captioning을 수행할 수 있었다. 인상적이었던 것은 foundation model, 즉 SAM의 prompt encoder와 decoder를 그대로 사용하여 성능을 높일 수 있었다는 부분이다.

📃 Reference

[24’ CVPR] PixelLLM: Pixel Aligned Language Models

I'm rubatoyeong