It's so handy to have an environment for working with the Gemma 3 12B QAT models, and other open models, right on my MacBook. Thanks, @lmstudio! Also quite like the design, and the hints: "Context is 2.3% full", "Tokens: 93/4096", advanced configuration settings, TTFT, etc. https://t.co/Bl2QBnLJ6z
AI-generated videos now possible with gaming GPUs with just 6GB of VRAM https://t.co/4EJmePn0tK
These GPU architectures influence model architectures and sizes. What can we expect? https://t.co/hzb6PG80P1
Google has introduced new Quantization-Aware Training (QAT) versions of its Gemma 3 language models, significantly reducing the memory requirements while maintaining model quality. The 27-billion-parameter Gemma 3 model's VRAM usage has been reduced from 54GB to 14.1GB, enabling it to run on consumer-grade GPUs such as the NVIDIA RTX 3090. This advancement allows for running Gemma 3 models locally on desktop GPUs instead of requiring high-end hardware like the H100. The QAT process integrates quantization during training, which minimizes performance degradation typically associated with post-training quantization, cutting perplexity drop by 54%. These models are available in multiple sizes, including 1B, 4B, 12B, and 27B parameters, and are supported by various platforms such as Ollama, Hugging Face, MLX, llama.cpp, and lmstudio. The open-source availability and reduced hardware demands are expected to broaden adoption and facilitate new AI product development. Google executives, including Sundar Pichai and Jeff Dean, have highlighted the significance of this development in making advanced AI models more accessible and efficient for developers and users.