BAGEL, the open-source Unified Multimodal Model you can fine-tune, distill and deploy anywhere, offering comparable functionality to proprietary systems like GPT-4o and Gemini 2.0 in an open form, unlocks useful and valuable image generation through a natively multimodal https://t.co/B2N6G17RGd
tis the year of any-to-any/omni models BAGEL by @BytedanceTalk 7B native multimodal model that understands and generates both image + text outperforms leading VLMs like Qwen 2.5-VL 👏 and has Apache 2.0 license 😱 https://t.co/sjFIBtDCcV
ByteDance released a 37-page report on training a Gemini-like native multimodal model! The most interesting part imo is on the "Integrated Transformer" architecture, where they use the same backbone to act both as a GPT-like autoregressive model as well as a DiT diffusion model https://t.co/eWkQxg0E8S
ByteDance's Seed team has introduced BAGEL, a 7-billion parameter open-source multimodal foundation model that integrates text, image, video, and 3D understanding and generation capabilities into a single unified decoder-only architecture. Released under the Apache 2.0 license, BAGEL outperforms leading visual-language models such as Qwen 2.5-VL and InternVL-2.5. The model employs a mixture-of-transformer-experts approach with dual encoders and is pretrained on trillions of data points. ByteDance also published a detailed 37-page report outlining BAGEL's innovative "Integrated Transformer" architecture, which enables the model to function both as a GPT-like autoregressive model and a DiT diffusion model. BAGEL's open-source nature allows for fine-tuning, distillation, and deployment across various applications, providing functionality comparable to proprietary systems like GPT-4o and Gemini 2.0 while enabling valuable image generation through its native multimodal design.