
Tech companies like Clarifai, Google Devs, OpenRouterAI, and others are introducing new models and APIs for efficient and faster Large Language Models (LLMs) in machine learning. These advancements include 1-bit LLMs, MediaPipe LLM Inference API, Nitro models, and updated models in MLX LM with faster attention mechanisms. The new models promise significant speed improvements, reduced GPU memory usage, and enhanced energy efficiency.
Announcing AQLM v1.1! Featuring: 1. New model collection with SOTA accuracy https://t.co/xHiCxr2t2S 2. Gemma-2B support, running within 1.5GB; 3. LoRA integration for training Mixtral-8x7 on Colab; 4. Faster generation (3x) via CUDA graphs. Check it out: https://t.co/T4fYggSEBm
Updated some models in MLX LM to use the new fast attention (h/t @argmaxinc) pip install -U mlx-lm 4-bit Mixtral (~45B) is quite fast now on an M2 Ultra, even for thousands of tokens: https://t.co/xDC095e8M0
LLMs are faster and more memory efficient in MLX! - All quantized models 30%+ faster h/t @angeloskath - Fused attention for longer context can be 2x+ faster and use way less memory h/t @bpkeene @atiorh @argmaxinc Some tokens-per-second benchmarks for 7B Mistral: https://t.co/co1wii9fY9




