Quantization Is All You Need SOTA LLMs are too large to run on laptops. Quantization is a technique used to reduce LLMs' computational and memory requirements and create a smaller version of the model. Quantization is central to OSS progress It involves converting the model's… https://t.co/bAnJLnONsW
[CL] Apple Intelligence Foundation Language Models https://t.co/O4KChBcb35 - The models introduced, AFM-on-device and AFM-server, are foundation language models designed to power Apple Intelligence features efficiently and responsibly. AFM-on-device is a ~3B parameter… https://t.co/kh6u4OmzgK
Can one achieve SOTA LLM performance at a much lower bitsize (=>memory/inference costs) than current (post-training) quantization? YES! - by training ternary LLMs - a "sweet spot" between underperforming binary and costly ful-precision ones. Happy to announce our recently… https://t.co/siObHApQ9P

Apple has introduced new foundation language models to enhance its Apple Intelligence features. The models, named AFM-on-Device and AFM-Server, are designed to operate efficiently and responsibly. The AFM-on-Device model, which has approximately 3 billion parameters and is around 6GB in 16-bit, is optimized for on-device usage and is quantized using 'accuracy recovering adapters.' This quantization process involves converting the model to a smaller size while maintaining performance by training low-rank adapters for 10 billion tokens. The AFM-Server model is a larger version designed for server-side applications. Researchers from Apple highlight that quantization is essential for reducing computational and memory requirements, making large language models more accessible for various devices.