DeepSeek has launched Janus-Pro, a new open-source multimodal large language model (MLLM) featuring 7 billion parameters. Built on the DeepSeek-LLM-1.5b/7b-base architecture, Janus-Pro excels in both image understanding and text processing, utilizing a visual encoding method called SigLIP-L. This model supports 384x384 image inputs at a downsample rate of 16 and is optimized for image generation. Early benchmarks indicate that Janus-Pro outperforms existing models such as Stable Diffusion and OpenAI's DALL-E 3. The advancements in Janus-Pro highlight the ongoing evolution of multimodal AI technologies, which integrate various forms of data for improved performance.
Mixture-of-Mamba (MoM) from @Stanford, @CarnegieMellon and @AIatMeta expands the benefits of Transformers to State Space Models (SSMs), making them better for multimodal tasks. MoM selects the best processing pathways for text, images, or speech dynamically, using modality-aware… https://t.co/ju79P2CqQq
it's crazy to me that no one has figured out how to make smarter models using non-text data. you'd think that giving language models the power to see learn from billions of internet images + videos, would make them smarter overall but it doesn't. or at least hasn't so far
new text-image models like DeepSeek Janus (and most current multimodal systems) are still supremely inelegant, messy frankensteins of unrelated components perhaps they are an act against god. something that never should’ve been