
The AI community is witnessing significant advancements with the Gemma model, showcasing its versatility and performance across various platforms. Google's Gemma model, particularly the 2B and 7B variants, is now runnable on a wide range of devices including iPhones, Android phones like the Samsung S23, and even 5-year-old Windows laptops without GPU acceleration, thanks to updates and contributions from developers across the globe. The model's performance is noteworthy, with speeds exceeding 2500 tokens per second on an H100 and reaching up to 650 tokens per second on TPUs, depending on the conditions. Moreover, the Gemma model has been integrated into several platforms and APIs, including MLC LLM for Android and web browsers, KerasNLP, and Anyscale Endpoints, where it's compared favorably against other leading models like Mistral-7B. Open-source contributions and optimizations, such as those from the MLX community and the release of Ollama v0.1.27, have further enhanced Gemma's accessibility and efficiency, demonstrating the model's potential in driving forward machine learning applications. Notably, Gemma 2B runs at 22+ tokens per second on iPhone and over 475 tokens per second powered by JAX, Transformers, and TPUs. Ollama v0.1.27 has made the model up to 100x more usable on a 5-year old Windows laptop with no GPU.







Besides Android, iOS, and web browsers, Gemma is also supported in MLC LLM on various GPUs! A single model definition does it all -- thanks to the ML compiler infra lead by @junrushao and many others! Try it in Google Colab: https://t.co/Xh25ZFAXch https://t.co/mPAzahVb3a
https://t.co/in4MFlYKaW now adds Gemma from @GoogleDeepMind! The 2b model is perfect for building in-browser agents with @WebGPU acceleration -- everything local! Here is a 1x speed demo of 4-bit quantized gemma-2b-it on @GooglePixel_US 7 Pro with @googlechrome. https://t.co/7gKTZYD1FH
thrilled to find that @ollama 0.1.27 is now 100x more usable on a 5-year old Windows laptop with no GPU. https://t.co/RpJpfuDJjU