
Alibaba Cloud has launched Qwen-2.5 72B and Qwen-2 VL 72B on Tune Chat, marking a significant upgrade in knowledge, coding, and math skills. These models support 29 languages and can process long contexts up to 128K tokens. Additionally, Qwen2.5-95B-Instruct has been introduced, leveraging layer similarity analysis to enhance performance. The AI community is also seeing advancements in custom floating-point formats, such as Near-Lossless FP6 and FP2 to FP7, which improve throughput and save memory. Tools like llmcompressor are making open-source LLMs faster and more cost-effective by applying quantization with a single line of code. These developments also include MacOS support on Apple Silicon.
Llama3 70B on 4GB GPU, Llama3.1 405B on 8GB GPU with AirLLM lib. Without quantization, distillation and pruning. 🔥 💡 Key Features: - Supports Llama, ChatGLM, QWen, Baichuan, Mistral, InternLM - 4-bit/8-bit compression: 3x inference speedup - MacOS support on Apple Silicon -… https://t.co/RaPLkoqYO0
You can now optimize and make any open-source LLM faster: 1. pip install llmcompressor 2. apply quantization with 1 line of code Two benefits: 1. Your LLM will run faster during inference time. 2. You will save a ton of money on hardware Here are a couple of examples: •… https://t.co/B37mrJO8TK
🚀 Launching Qwen2.5-95B-Instruct! Built on Qwen2.5-72B-Instruct, it leverages layer similarity analysis. The first 40% of layers significantly influence the Next Token Prediction most, enabling possibly even better performance. https://t.co/Rvo7p48yfk https://t.co/7ZFPBA0TCW
