[24' EMNLP] TroL: Traversal of Layers for Large Language and Vision Models 2024.10.14 | Multimodal Adapter
[24' EMNLP] MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model 2024.10.14 | Multimodal Interpretability
[24' ACL] Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space 2024.10.14 | Multimodal Alignment
[24' ECCV] BLINK: Multimodal Large Language Models Can See but Not Perceive 2024.10.13 | Multimodal Benchmark Perception
[24'] Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations 2024.10.11 | Multimodal Interpretability
[24'] Towards Interpreting Visual Information Processing in Vision-Language Models 2024.10.11 | Multimodal Interpretability
[24'] Quadratic Is Not What You Need For Multimodal Large Language Models 2024.10.11 | Multimodal Efficiency Pruning
[24'] Intriguing Properties of Large Language and Vision Models 2024.10.11 | Multimodal Interpretability
[24'] PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model 2024.10.07 | Multimodal Unified Segmentation
[24'] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models 2024.09.10 | Multimodal Few-shot Learning Foundation Model
[23' ICML] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 2024.09.10 | Multimodal Foundation Model
[24'] LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models 2024.09.04 | Multimodal Attention Efficiency
[24' ECCV] FlexAttention for Efficient High-Resolution Vision-Language Models 2024.09.04 | Multimodal Attention Efficiency
[24' ECCV] FastV: An Image is Worth 1/2 Tokens After Layer 2 2024.09.03 | Multimodal Attention Efficiency
[24'] HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments 2024.08.27 | Multimodal Attention Efficiency
[24'] Hallucination of Multimodal Large Language Models: A Survey 2024.08.24 | Multimodal Hallucination Summary
[24' ACL Findings] Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers 2024.08.23 | Multimodal Interpretability
[Summary] Multimodal Contrastive Decoding Variants (2) 2024.08.19 | Multimodal Contrastive Decoding Hallucination Summary
[24' ICLR-WS] A Concept-Based Explainability Framework for Large Multimodal Models 2024.08.19 | Multimodal Interpretability
[24'] HIO: Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization 2024.08.19 | Multimodal Contrastive Decoding Hallucination Interpretability Summary
[Summary] Multimodal Contrastive Decoding Variants (1) 2024.08.17 | Multimodal Contrastive Decoding Hallucination Summary
[24'] IBD: Alleviating Hallucinations in Large Vision-Language Models via Image-Biased Decoding 2024.08.17 | Multimodal Contrastive Decoding Hallucination Interpretability Summary
[24' CVPR] M3ID: Multi-Modal Hallucination Control by Visual Information Grounding 2024.08.17 | Multimodal Contrastive Decoding Hallucination
[24' ICML] Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations 2024.08.17 | Multimodal Evaluation
[24' CVPR] VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding 2024.08.16 | Multimodal Contrastive Decoding Hallucination
[24' ICLR-WS] Skip n: A Simple Method to Reduce Hallucination in Large Vision-Language Models 2024.08.14 | Multimodal Decoding Hallucination
[24' ICLR] Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning 2024.08.13 | Multimodal Evaluation In-Context Learning
[24'] VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap 2024.08.13 | Multimodal Decoding Hallucination
[24' CVPR] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation 2024.07.30 | Multimodal Hallucination Interpretability
[24' CVPR] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models 2024.07.30 | Multimodal Benchmark Hallucination
[24' ICML] Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models 2024.07.05 | Multimodal Analysis Instruction Tuning Visual Encoder
[24' CVPR] Osprey: Pixel Understanding with Visual Instruction Tuning 2024.07.05 | Multimodal Instruction Tuning Visual Encoder Visual Perception
[24'] ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models 2024.07.05 | Multimodal Visual Encoder
[24'] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 2024.07.05 | Multimodal Analysis Visual Encoder
[24'] LaSagnA: Language-based Segmentation Assistant for Complex Queries 2024.07.03 | Multimodal Referring Segmentation
[24' CVPR] Compositional Chain-of-Thought Prompting for Large Multimodal Models 2024.07.03 | Multimodal Chain-of-Thought
[24' CVPR] AnyRef: Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception 2024.07.03 | Multimodal Referring Segmentation Visual Grounding
[24' CVPR] PerceptionGPT: Effectively Fusing Visual Perception into LLM 2024.07.02 | Multimodal Analysis Detection Segmentation Visual Perception
[23'] LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models 2024.07.02 | Multimodal Segmentation Visual Grounding
[24'] F-LMM: Grounding Frozen Large Multimodal Models 2024.07.02 | Multimodal Chain-of-Thought Segmentation Visual Grounding
[24' ICLR] KOSMOS-2: Grounding Multimodal Large Language Models to the World 2024.07.01 | Multimodal Visual Grounding
[23' NIPS] KOSMOS-1: Language Is Not All You Need: Aligning Perception with Language Models 2024.07.01 | Multimodal Chain-of-Thought Foundation Model In-context Learning
[24' CVPR] GROUNDHOG: Grounding Large Language Models to Holistic Segmentation 2024.07.01 | Multimodal Panoptic Segmentation Segmentation Visual Grounding
[24'] GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest 2024.07.01 | Multimodal Instruction Tuning Visual Grounding
[24' ICLR] Ferret: Refer and Ground Anything Anywhere at Any Granularity 2024.07.01 | Multimodal Detection Visual Grounding
[23'] Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic 2024.06.26 | Multimodal Chain-of-Thought Detection Visual Grounding
[24'] MMStar: Are We on the Right Way for Evaluating Large Vision-Language Models? 2024.06.26 | Multimodal Benchmark Bias
[24' CVPR] VIVL: Towards Better Vision-Inspired Vision-Language Models 2024.06.25 | Multimodal Adapter Prefix Tuning
[24'] Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models 2024.06.25 | Multimodal Chain-of-Thought
[23'] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model 2024.06.25 | Multimodal Joint Training Prefix Tuning
[24' TMLR] Multimodal Chain-of-Thought Reasoning in Language Models 2024.06.22 | Multimodal Chain-of-Thought
[24' CVPR] PixelLM: Pixel Reasoning with Large Multimodal Model 2024.06.21 | Multimodal Reasoning Segmentation
[24' CVPR] GSVA: Generalized Segmentation via Multimodal Large Language Models 2024.06.21 | Multimodal Referring Segmentation
[24' CVPR] GLaMM: Pixel Grounding Large Multimodal Model 2024.06.21 | Multimodal Segmentation Dataset Visual Grounding
[24' CVPR] LLaFS: When Large Language Models Meet Few-Shot Segmentation 2024.06.11 | Multimodal Few-Shot Segmentation In-context Learning
[24' CVPR] LLaVA-1.5: Improved Baselines with Visual Instruction Tuning 2024.06.10 | Multimodal Adapter Instruction Tuning
[24' CVPR] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs 2024.06.07 | Multimodal Analysis Visual Encoder
[24' CVPR] Honeybee: Locality-enhanced Projector for Multimodal LLM 2024.06.04 | Multimodal Adapter Analysis
[23'] LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model 2024.06.02 | Multimodal Reasoning Segmentation
[24'] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 2024.05.29 | Multimodal Analysis In-context Learning
[23'] LISA: Reasoning Segmentation via Large Language Model 2024.05.29 | Multimodal Reasoning Segmentation
[22' NIPS] Flamingo: a Visual Language Model for Few-Shot Learning 2024.05.27 | Multimodal Foundation Model In-context Learning
[21' ICML] VL-T5: Unifying Vision-and-Language Tasks via Text Generation 2024.05.15 | Multimodal Multi-task Learning