
Apple has introduced new foundation language models to enhance its Apple Intelligence features. The models, named AFM-on-Device and AFM-Server, are designed to operate efficiently and responsibly. The AFM-on-Device model, which has approximately 3 billion parameters and is around 6GB in 16-bit, is optimized for on-device usage and is quantized using 'accuracy recovering adapters.' This quantization process involves converting the model to a smaller size while maintaining performance by training low-rank adapters for 10 billion tokens. The AFM-Server model is a larger version designed for server-side applications. Researchers from Apple highlight that quantization is essential for reducing computational and memory requirements, making large language models more accessible for various devices.
Quantization Is All You Need SOTA LLMs are too large to run on laptops. Quantization is a technique used to reduce LLMs' computational and memory requirements and create a smaller version of the model. Quantization is central to OSS progress It involves converting the model's… https://t.co/bAnJLnONsW
[CL] Apple Intelligence Foundation Language Models https://t.co/O4KChBcb35 - The models introduced, AFM-on-device and AFM-server, are foundation language models designed to power Apple Intelligence features efficiently and responsibly. AFM-on-device is a ~3B parameter… https://t.co/kh6u4OmzgK
Can one achieve SOTA LLM performance at a much lower bitsize (=>memory/inference costs) than current (post-training) quantization? YES! - by training ternary LLMs - a "sweet spot" between underperforming binary and costly ful-precision ones. Happy to announce our recently… https://t.co/siObHApQ9P