Google has launched Gemma 3n, a multimodal model designed for mobile and edge devices. Gemma 3n can process text, images, audio, and video, and operates efficiently on devices with as little as 2GB of RAM. The model reduces RAM usage by nearly 3x, offers approximately 1.5x faster response on mobile, and is available in early preview. It will be integrated into Android and Chrome platforms. Sarvam AI, an Indian startup, has released Sarvam-M, a 24-billion-parameter open-weights hybrid language model built on Mistral Small. Sarvam-M demonstrates strong performance in math, programming, and Indian language tasks, achieving over 86% improvement on the romanised Indian language GSM-8K benchmark. It outperforms Llama-4 Scout and is comparable to larger models such as Llama-3.3 70B. The model is accessible via API and available for download. Nvidia has introduced AceReason-Nemotron-14B, based on DeepSeek-R1-Distill-Qwen-14B, with both 7B and 14B variants. The model advances math and code reasoning through reinforcement learning, applying RL first on math-only prompts and then on code-only prompts, resulting in substantial improvements in benchmark accuracy. ByteDance has released a unified multimodal AI model that matches GPT-4o and Gemini 2.0 capabilities with only 7 billion parameters. The 100% open-source model supports multiple modalities and includes a 'Thinking mode' for advanced reasoning. Additional research includes Meta-PerSER, a personalized speech emotion recognition framework using meta-learning, and Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning approach for efficient large language model reasoning.
LLMs converge to a universal embedding space. This space is a geometric representation of the concept space used by human brains, and manifested in human language = pretraining data. "A universal semantic structure conjectured by the Platonic Representation Hypothesis" https://t.co/0184v2gCUl
I spent a few hours today trying to reverse engineer Google's new Gemma 3n model which was published to HuggingFace as a compiled binary. I wanted to figure out how exactly they cram a model supposedly striking distance from Claude 3.7 Sonnet on LMArena into 2GB of RAM.
Large language models using explicit reasoning degrade Vision-Language Navigation accuracy during inference. The paper proposes Aux-Think, which trains models to internalize reasoning patterns using Chain-of-Thought supervision but predicts actions directly during inference. https://t.co/Ev2EjOzf6k