May 26, 10:03 PM

ByteDance Unveils BAGEL with 7B Active Parameters and MMaDA Multimodal Model in Collaboration with Princeton, Peking, Tsinghua

ByteDance has released BAGEL, a 14-billion parameter vision-language model with 7 billion active parameters, capable of fast and precise image editing using only text inputs. Despite its relatively small size, BAGEL delivers performance comparable to much larger models and is fully open weight. The company has also contributed to advancements in multimodal AI through innovative training methods that integrate diverse data types such as text, images, video frames, and webpages, enabling models to better connect visual and textual information. Additionally, ByteDance is involved in the development of MMaDA, an 8-billion parameter unified multimodal diffusion model designed for textual reasoning, visual understanding, and image generation, created in collaboration with researchers from Princeton University, Peking University, and Tsinghua University. This trend towards smaller, efficient models capable of handling multimodal data is gaining traction in the AI research community, with other notable developments including LLaDA, an 8-billion parameter language model that uses diffusion techniques instead of traditional autoregressive methods, and Meta AI's Multi-SpatialMLLM, which enhances spatial understanding in multimodal large language models.