ByteDance has released BAGEL, a 14-billion parameter vision-language model with 7 billion active parameters, capable of fast and precise image editing using only text inputs. Despite its relatively small size, BAGEL delivers performance comparable to much larger models and is fully open weight. The company has also contributed to advancements in multimodal AI through innovative training methods that integrate diverse data types such as text, images, video frames, and webpages, enabling models to better connect visual and textual information. Additionally, ByteDance is involved in the development of MMaDA, an 8-billion parameter unified multimodal diffusion model designed for textual reasoning, visual understanding, and image generation, created in collaboration with researchers from Princeton University, Peking University, and Tsinghua University. This trend towards smaller, efficient models capable of handling multimodal data is gaining traction in the AI research community, with other notable developments including LLaDA, an 8-billion parameter language model that uses diffusion techniques instead of traditional autoregressive methods, and Meta AI's Multi-SpatialMLLM, which enhances spatial understanding in multimodal large language models.
This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation Researchers from Princeton University, Peking University, Tsinghua University, and ByteDance have introduced MMaDA, a unified multimodal https://t.co/f7IZxkgAoY
Meta AI Introduces Multi-SpatialMLLM: A Multi-Frame Spatial Understanding with Multi-modal Large Language Models Researchers from FAIR Meta and the Chinese University of Hong Kong have proposed a framework to enhance MLLMs with robust multi-frame spatial understanding. This https://t.co/WHEVTVoZ8e
This paper shows LLMs trained only on text can understand images and audio just by reading, allowing them to act as encoders using minimal fine-tuning. Methods 🔧: →Input images or audio waveforms divide into non-overlapping patches. →Each patch is flattened into a vector. https://t.co/E1aHaRUDsA