
Google's Gemma model, with versions including 2B and 7B, has emerged as a significant innovation in the open-source, generative Large Language Models (LLMs) domain. The Gemma 2B version impresses with its performance, achieving 22+ tokens per second on an iPhone and over 475 tokens per second on TPU v2, with the potential to reach up to 650 tokens per second under certain conditions. This version, powered by JAX, Transformers, and TPUs, is up to 4x faster than PyTorch on A100. Furthermore, Gemma's compatibility extends to Android through MLC LLM, demonstrated by the 4-bit quantized Gemma-2b model running on a Samsung S23. The model's support is also expanding across platforms, including availability on Together API and Anyscale Endpoints, with versions Gemma-7b-it, Gemma-2b-it, Gemma-7b, and Gemma-2b. The integration efforts by various developers and the model's architecture, which includes a large vocabulary, have been highlighted as key factors in its performance. Additionally, Gemma's application in building in-browser agents with WebGPU acceleration, as demonstrated on a Google Pixel US 7 Pro with Google Chrome, showcases its versatility and potential for innovation in AI, emphasizing everything local.







Google's new Gemma LLM is outperforming the Mistral-7B in various benchmarks. It Includes: - Question Answering - Math/Science - Reasoning - Coding Live comparison of Google's Gemma vs. Mistral-7B: https://t.co/UyeC0K290Q
Gemma is now available for use and finetune on Lightning Studios now. Shout out to Google for joining the open source AI effort. Great explanation by @rasbt https://t.co/upGWrLl8UF
Google's Gemma has been the topic of the week for both LLM researchers and users. My colleagues and I just ported the code to LitGPT, and we discovered some interesting surprises and model architecture details along the way: 1) Gemma uses a really large vocabulary and… https://t.co/PI5IXqYZh0