2024.10.07 [23’ CVPR] ZS-RS: Zero-Shot Referring Image Segmentation With Global-Local Context Features Vision Referring Image Segmentation Training-free
2024.10.07 [24’ ECCV] SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference Vision Semantic Segmentation Training-free
2024.09.10 [24’] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Multimodal Few-shot Learning Foundation Model
2024.09.10 [23’ ICML] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Multimodal Foundation Model
2024.09.04 [24’] LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models Multimodal Attention Efficiency