Llama3 70B on 4GB GPU, Llama3.1 405B on 8GB GPU with AirLLM lib. Without quantization, distillation and pruning. 🔥 💡 Key Features: - Supports Llama, ChatGLM, QWen, Baichuan, Mistral, InternLM - 4-bit/8-bit compression: 3x inference speedup - MacOS support on Apple Silicon -… https://t.co/RaPLkoqYO0
You can now optimize and make any open-source LLM faster: 1. pip install llmcompressor 2. apply quantization with 1 line of code Two benefits: 1. Your LLM will run faster during inference time. 2. You will save a ton of money on hardware Here are a couple of examples: •… https://t.co/B37mrJO8TK
🚀 Launching Qwen2.5-95B-Instruct! Built on Qwen2.5-72B-Instruct, it leverages layer similarity analysis. The first 40% of layers significantly influence the Next Token Prediction most, enabling possibly even better performance. https://t.co/Rvo7p48yfk https://t.co/7ZFPBA0TCW
Alibaba Cloud has launched Qwen-2.5 72B and Qwen-2 VL 72B on Tune Chat, marking a significant upgrade in knowledge, coding, and math skills. These models support 29 languages and can process long contexts up to 128K tokens. Additionally, Qwen2.5-95B-Instruct has been introduced, leveraging layer similarity analysis to enhance performance. The AI community is also seeing advancements in custom floating-point formats, such as Near-Lossless FP6 and FP2 to FP7, which improve throughput and save memory. Tools like llmcompressor are making open-source LLMs faster and more cost-effective by applying quantization with a single line of code. These developments also include MacOS support on Apple Silicon.