Apr 18, 04:00 PM

Google’s Gemma 3 QAT Models Cut VRAM From 54GB to 14.1GB, Run 27B Model on NVIDIA RTX 3090 with Multi-Platform Support

Google has introduced new Quantization-Aware Training (QAT) versions of its Gemma 3 language models, significantly reducing the memory requirements while maintaining model quality. The 27-billion-parameter Gemma 3 model's VRAM usage has been reduced from 54GB to 14.1GB, enabling it to run on consumer-grade GPUs such as the NVIDIA RTX 3090. This advancement allows for running Gemma 3 models locally on desktop GPUs instead of requiring high-end hardware like the H100. The QAT process integrates quantization during training, which minimizes performance degradation typically associated with post-training quantization, cutting perplexity drop by 54%. These models are available in multiple sizes, including 1B, 4B, 12B, and 27B parameters, and are supported by various platforms such as Ollama, Hugging Face, MLX, llama.cpp, and lmstudio. The open-source availability and reduced hardware demands are expected to broaden adoption and facilitate new AI product development. Google executives, including Sundar Pichai and Jeff Dean, have highlighted the significance of this development in making advanced AI models more accessible and efficient for developers and users.

#Google #H100 #Ollama #Hugging Face #MLX #Sundar Pichai #Jeff Dean

Written with ChatGPT (GPT-4).