
Researchers from Microsoft Corporation and the University of Surrey have developed MInference (Million-tokens Inference), a training-free efficient method for the pre-filling stage of long-context language models (LLMs) based on dynamic sparse attention. This innovation aims to supercharge long text processing and cut costs by using A-shape, Vertical-Slash, and Block-Sparse patterns, speeding up pre-filling for LLMs by up to 10x without losing accuracy. The code for MInference is open-source, allowing for broader adoption and potential transformation in AI processing by up to 90%. Additionally, Meta has introduced MobileLLM, a compact language model designed for mobile devices. MobileLLM prioritizes model depth over width, implements embedding sharing and grouped-query attention, and utilizes a novel immediate attention mechanism. This model aims to make sub-billion parameter LLMs suitable for on-device use, reducing reliance on cloud computing and improving response times. Meta's approach could lead to a shift from large-scale models to more efficient, smaller models for edge devices.







Microsoft drops ‘MInference’ demo, challenges status quo of AI processing: Microsoft unveils MInference, a groundbreaking AI technology that accelerates language model processing by up to 90%, potentially transforming long-context AI… https://t.co/IsZkymwoMf #AI #Automation
Microsoft drops 'MInference' demo, challenges status quo of AI processing https://t.co/MUNxJKqW5L
Meta AI develops compact language model for mobile devices: Key innovations in Meta's MobileLLM include prioritizing model depth over width, implementing embedding sharing and grouped-query attention and utilizing a novel immediate… https://t.co/yXB9wv9ips #AI #aimodels