Recent advancements in Multi-modal Large Language Models (MLLMs) have sparked significant interest in the AI research community, particularly regarding their ability to understand visual elements in mathematical problems and general visual contexts. Google AI Research introduced ChartPaLI-5B, a pioneering method aimed at enhancing vision-language models for multimodal reasoning. Despite the progress, concerns remain about whether MLLMs rely on visual or textual shortcuts rather than genuine comprehension. The CLIP model, a cornerstone for state-of-the-art (SOTA) MLLMs like LLaVA (and potentially GPT-4V?), has been identified as struggling with compositional generalization, a critical aspect for fine-grained visual understanding. A recent paper presented at NAACL2024 proposes ComCLIP, a training-free method to improve the compositionality of CLIP-like models, addressing one of the key bottlenecks in the field.
CLIP, the backbone of SOTA Multimodal LLMs such as LLaVA (and GPT-4V?), still suffers from compositional generalization. Our recent #NAACL2024 paper introduces a training-free method, ComCLIP, to improve the compositionality of CLIP-like models, towards more fine-grained visual… https://t.co/cmychcrgud
CLIP, the backbone of SOTA Multimodal LLMs such as LLaVA (and GPT-4V?), still suffers from compositional generalization. Fine-grained visual understanding remains a bottleneck for Multimodal LLMs. To this end, our recent #NAACL2024 paper introduces a training-free method,… https://t.co/cmychcrgud
CLIP is the backbone of existing SOTA Multimodal LLMs such as LLaVA (and GPT-4V?). But it still suffers from compositional generalization, and fine-grained visual understanding remains a bottleneck for Multimodal LLMs. Our recent #NAACL2024 paper introduces a training-free… https://t.co/cmychcrgud