
Recent advancements in machine learning models have focused on improving inference speed and reducing memory usage through quantization techniques. The Cartesia AI's on-device models, such as Rene 1.3B, are now available on macOS, utilizing MLX 4-bit quantization for enhanced performance. Additionally, the combination of transformers and torchao has resulted in approximately twice the inference speed and four times lower memory usage, with support for 4-bit and 8-bit quantization. These models are compatible with torch.compile, maintaining minimal performance loss. Users can now experiment with these advancements on models like Whisper and Llama. The low-bit quantized open LLM leaderboard highlights the competitiveness of the auto-round quantization library. Furthermore, MLX models running on MacBook have achieved a twofold increase in speed with multi-threading, demonstrated on M3 and Gemma-2B models.
MLX keeps improving! Running models on MacBook now 2x faster with multi-threading. Here on my M3 and Gemma-2B. Impressive! https://t.co/SnLoonS31x https://t.co/T7Nriy39n9
📷 Low-bit Quantized Open LLM Leaderboard The auto-round quantization lib is very competitive. https://t.co/DFYJ7ZyHno
transformers + torchao = 🔥 1. ~2x faster inference 2. ~4x lower memory usage 3. 4-bit + 8-bit quantization support 4. compatible with torch.compile all with minimal loss in performance ⚡ bonus - you can try it on whisper & llama-like models today w/ transformers 🤗 https://t.co/JjWF1L3JdP