DeepSeek AI, in collaboration with researchers from the University of Hong Kong and Peking University, has unveiled Janus, a 1.3 billion parameter multimodal model that integrates image generation capabilities. Janus is designed as an autoregressive framework that unifies multimodal understanding and generation by decoupling visual encoding, using different visual encoders for understanding and generation, which enhances flexibility and performance. The model is built upon DeepSeek-LLM-1.3b-base and incorporates SigLIP-L as its vision encoder. Despite its advanced capabilities, Janus is super small in size, only 1.8 billion parameters. Utilizing a single transformer architecture, Janus is trained on approximately 500 billion text tags and employs a specific tokenizer for image generation with a downsampling rate of 16. As DeepSeek AI's first multimodal offering on Hugging Face, the model is now available for download.
DeepSeek-AI Released Janus: A 1.3B Multimodal Model with Image Generation Capabilities Built on DeepSeek-LLM-1.3b-base, the model is trained on about 500B text tags Image generation uses a specific tokenizer with a downsampling rate of 16 Key Features: 1️⃣ Decoupled visual… https://t.co/waf1901wIZ
NEW multimodal AI framework from the Chinese community 🚀 Janus🔥a NEW autoregressive framework for multimodal AI just dropped by @deepseek_ai Model: https://t.co/f8fIGugLDY Paper: https://t.co/vMl62ZcPrX ✨ Decoupled Visual Encoding ✨ Single Transformer Architecture ✨…
Janus from DeepSeek joins Chameleon+Anole (GAIR) and Emu-3 (BAAI) in multimodal understanding + generation Input ← image+text Output → image+text Do we need a nickname/acronym for that now VLIM? (Visual Language Image Model?) 😅 https://t.co/u0qK6qwvh6