The Beijing Academy of Artificial Intelligence (BAAI) has launched Emu3, a state-of-the-art multimodal model trained via next-token prediction. Emu3 is capable of generating images by predicting vision tokens and supports various resolutions and styles. It outperforms existing models such as sdxl in image generation, llava 1-6 in image understanding, and open sora in video generation. Emu3 features 9 billion parameters and includes a state-of-the-art tokenizer. The model is available with open weights and can be accessed on Hugging Face. A demo is also available.
Emu3, a new kid in the block: VLM 🤝 text-to-image Emu3 is a VLM that can also generate images, turning it into a truly end-to-end multimodal model 🖼️✏️ 📋 Code: https://t.co/ErohhuIJzw 🏋️ Weights: https://t.co/XHf1fmMJ7W ▶️ Demo: https://t.co/90nO0Sxbh8 https://t.co/DQgMwvEurl
Emu3 is a next token prediction multimodal early fusion model. And it's open weights! It outperforms sdxl on image generation, llava 1-6 on image understanding, and open sora on video generation. It's 9B parameters and includes a state of the art tokenizer (VQ of course) https://t.co/phE70HEu8J
Emu3🔥 The latest multimodal models released by @BAAIBeijing from the Chinese community,trained via only next-token prediction. https://t.co/e2ltjCByeF Paper: https://t.co/ovSnsYghR0 ✨ Predicts vision tokens to generate images, supporting various resolutions and styles. ✨…