The on-device AI framework ecosystem is experiencing significant advancements with the release and optimization of several large language models (LLMs). Key developments include the llama.cpp framework, which supports Whisper, LLMs, and VLMs across various backends such as Metal and CUDA, and the MLC framework, which deploys LLMs across platforms, particularly WebGPU. NVIDIA has highlighted the acceleration of LLMs with llama.cpp on RTX systems, while LM Studio 0.3.4 now ships with Apple MLX, allowing Llama 3.2 1B to run at approximately 250 tokens per second on M3 Apple Silicon Macs. The AIatMeta's Llama 3.2 Vision Multimodal LLM, optimized for both text and images, is now available for chatbot creation and deployment with LitServe for high performance. Additionally, NVIDIA's ongoing optimizations for leading LLMs are designed to deliver high throughput and low latency, with specific enhancements for Llama 3.1 405B performance on NVIDIA HGX H200 systems, achieving a 1.5x increase in throughput and a 1.2x speedup in the MLPerf Inference v4.1 benchmark. The DeepLearningAI course on Llama 3.2 provides updates on multimodal capabilities and Llama Stack, while SambaNova Cloud offers fast speeds for developing with Llama 3.2. The AI News production by Swyx, editor of Latent Space Podcast, utilizes llama models to reach over 3 million viewers.
Learn how the right parallelism technique increases #Llama 3.1 405B performance by 1.5x in throughput-sensitive scenarios on an NVIDIA HGX H200 system with NVLink and NVSwitch, and enables a 1.2x speedup in the MLPerf Inference v4.1 Llama 2 70B benchmark. https://t.co/X58zamcbD4
Boost your application performance and ROI with our ongoing optimizations of leading #LLMs, designed to deliver high throughput and low latency for real-time demands. Learn more. https://t.co/zfopQtBIN5
Check out this 10-minute update on vLLM v0.6.2! Hear from @mgoin_ as we dive into the latest features, including: ✅ Llama 3.2 Vision support ✅ MQLLMEngine for API Server ✅ Beam search externalization Watch the full breakdown 👇 https://t.co/mn1gaNd248