Zhipu AI and Tsinghua University have introduced CogVLM2, a new generation of multimodal visual language models designed for enhanced image and video understanding, as well as temporal grounding in open-source applications. The advancements in CogVLM2 aim to improve the capabilities of AI in processing and interpreting visual data. Additionally, the vLLM platform has announced support for audio and image inputs, with upcoming features including video support and multi-modal LLMs, which are expected to expand the use cases for generative AI. DriveGenVLM is also advancing autonomous driving with generated videos and vision language models. These developments highlight significant progress in the field of artificial intelligence and machine learning, with further discussions expected at the Ray Summit, including updates on Qwen2-VL.
🎵Audio and Images inputs are now supported in vLLM, with video and others coming as well. Multi-modal LLMs are coming and will open a huge number of use cases for generative AI. In this year's Ray Summit, we're hosting a vLLM track, where the creators of vLLM, key… https://t.co/UPMbjLfBn5
Exciting updates for multi-modality on vLLM! - Audio LMM - Multi-image inference - Tensor parallelism on vision encoders - Image embeddings as input - More LMMs supported (11 in total!) - Upcoming support for Video LMM and Qwen2-VL Read more here: https://t.co/R2yh9k65Or!
CogVLM2: Advancing Multimodal Visual Language Models for Enhanced Image, Video Understanding, and Temporal Grounding in Open-Source Applications #DL #AI #ML #DeepLearning #ArtificialIntelligence #MachineLearning #ComputerVision https://t.co/PwqSurO9C9