Recent advancements in multimodal generative models have been highlighted in various presentations and papers. At the #ICML2024 conference, a significant improvement in the text encoder performance of JinaAI's CLIP model was reported, achieving a 165% increase in capability. This was detailed by a colleague of the presenter, emphasizing the effectiveness of contrastive text-image pre-training for cross-modal retrieval, although it noted limitations in text capability. Additionally, the introduction of Lumina-mGPT was announced, a family of multimodal autoregressive models designed for diverse vision and language tasks. Other discussions included the benefits of adapting large pre-trained text-to-image models for immersive scene generation, particularly in panorama creation, which addresses the challenges posed by the high costs of acquiring multi-view images. Despite these advancements, it was noted that existing vision-language models still struggle with reasoning over spatial relationships. The TexGen project was also presented, focusing on text-guided 3D texture generation through multi-view sampling and resampling techniques.
TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling https://t.co/UkgXLGe8I0 https://t.co/IwVjToZSq7
Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships.… https://t.co/Lstw0u0ESV
Immersive scene generation, notably panorama creation, benefits significantly from the adaptation of large pre-trained text-to-image (T2I) models for multi-view image generation. Due to the high cost of acquiring multi-view images, tuning-free generation is preferred. However,… https://t.co/xU9RCgGr2a