[23’ CVPR] GRES: Generalized Referring Expression Segmentation

Date: 2024.06.21 Updated: 2024.06.21

카테고리: Vision

🔍 Abstract

본 논문에서는 GRES(Generalized Referring Expression Segmentation)이라는 Task를 제안하고, 이에 해당하는 baseline model인 ReLA를 제안한다. 이 논문은 CVPR 2023 Highlight을 받은 논문이다.

먼저 GRES는 RES(Referring Expression Segmentation)을 확장한 Task로, 임의의 target object 수를 가진 multi-target expression과 empty-target expression에 대해 robust하게 대답해야 하는 Task이다. 기존 RES task는 one-to-one segmentation만 가능하기에 굉장히 불편하고 깊이 있는 이해가 어려웠는데, GRES로 이를 해결한 것이다. 이를 서포트하기 위해 gRefCOCO라는 데이터셋도 같이 제안하였다.

다음으로 ReLA는 이러한 GRES 문제를 풀기 위해 제안한 모델로 Region-Image Cross Attention(RIA)를 통해 Adaptive Region Feature를 추출하고, Region-Language Cross Attention(RLA)를 통해 Language Interaction을 수행한다. 이 모델은 GRES, RES 모두에서 SOTA를 달성했다.

1. GRES

저자들은 gRefCOCO 데이터셋을 구축하는 과정에서 다음 기준으로 데이터셋을 구성했다.

Multi-target samples.
1. Usage of counting expressions. (e.g. “The two people on the far left”)
2. Compound sentence structures without geometrical relation. (e.g. “A and B”, “A except B”, and “A with B or C”)
3. Domain of attributes. (e.g. “the right lady in blue and kid in white”)
4. More complex relationships. (e.g. Figure 3(b))
No-target samples.
1. The expression cannot be totally irrelevant to the image. (e.g. in Figure 3(a), “The kid in blue” (O), “The dog in blue” (X))
2. The annotators could choose a deceptive expression.

2. ReLA

처음에 baseline architecture를 보고 놀랐다. 굉장히 많은 부분이 disentangle되어 있는데, 이 각각이 모두 의미가 있다는 것을 Ablation Study를 통해 증명했다. 각각의 Feature를 쉽게 이해해보자면 다음과 같다.

Region-Image Cross Attention(RIA)
- Adaptive Region Feature를 추출하기 위한 모듈.
- 일반적인 ViT에서 단순히 이미지를 $P^2$ 개의 Patch로 hard-splitting하는 것과 달리, 전체 이미지에서 Adaptive하게 Region Feature를 추출할 수 있도록 하여 Patch의 Flexibility를 높였다. (Hard-split과 비교하여 큰 폭의 성능 향상이 있다. 아래 Table 2 참고.)
Region-Language Cross Attention(RLA)
- Language Interaction을 수행하기 위한 모듈.
Three Outputs
- Region Filter $F_f$: Mask Feature와 곱해져 Regional Segmentation Mask를 생성한다.
- Region Probability Map $x_r$: 각 Region이 Target Object일 확률을 나타낸다. (필요할까 싶겠지만, 이런 Intermediate Feature를 Explicit하게 학습하는 것이 성능에 도움이 된다고 한다.)
- No-target Judgement Score $E$: Target 여부를 0/1로 나타낸다. (명시적으로 존재 여부를 나타내는 것이 성능에 도움이 된다고 한다.)

3. Results

3.1. Importance of Dataset

Importance of Dataset. GRES Dataset으로 훈련해야 충분한 성능을 발휘할 수 있다.

3.2. GRES Results

GRES에서 SOTA를 달성했다. 다른 모델들과 큰 차이가 난다.

그러나, 아직 서로 붙어 있는 object를 잘 구별하지 못하는 모습이다. 아직 Fine-grained detail을 구분하는 능력이 부족하다.

3.3. RES Results

RES에서도 SOTA를 달성했다. 이는 ReLA의 Explicit Modeling이 다른 Task에도 도움이 된다는 것이다.

💡 Summary

GRES는 RES를 확장한 Task로, multi-target expression과 empty-target expression에 대해 robust하게 대답해야 하는 Task이다. 이를 위해 gRefCOCO라는 데이터셋을 구축하였다. ReLA는 이러한 GRES 문제를 풀기 위해 제안한 모델로 Region-Image Cross Attention(RIA)를 통해 Adaptive Region Feature를 추출하고, Region-Language Cross Attention(RLA)를 통해 Language Interaction을 수행한다. 이 모델에서의 Explicit Modeling이 성능 향상의 주 원인이다. 이 모델은 GRES, RES 모두에서 SOTA를 달성했다. 개인적으로 여기서 더 집중할만한 부분은 RIA와 같은 Adaptive Region Feature Extractor인 것 같다. 이렇게 Patch embedding의 Flexibility를 높이는 것은 어디에나 추가할 수 있는 방법론이기에 다른 Task에도 적용해볼 만한 가치가 있다고 생각한다.

📃 Reference

[23’ CVPR] GRES: Generalized Referring Expression Segmentation

I'm rubatoyeong