Researchers from Stanford University, Zhejiang University, and Ant Group have introduced a new method to enhance the efficiency of large language models (LLMs). The approach, dubbed MinionS, involves a collaboration between small on-device LLMs and more powerful cloud-based models to handle complex reasoning tasks more economically. The MinionS protocol allows the cloud model to break down tasks into simpler subtasks that are processed locally on devices, reducing cloud inference costs by an average of 5.7 times while maintaining 97.9% of the performance of the cloud model alone. This method aims to address the challenges of following multi-step instructions and reasoning over long contexts by leveraging local data and computational resources. It was found that a naive collaboration protocol could achieve a 30.4x reduction in remote costs, but only recover 87% of the performance of the frontier model. The research also explores other advancements in LLM efficiency, such as LightThinker, which proposes dynamic compression of reasoning steps for financial, medical, and scientific tasks, and various frameworks like vLLM, LMDeploy, and SGLang, which focus on optimizing LLM inference through different techniques.
[LG] Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models A Narayan, D Biderman, S Eyuboglu, A May... [Stanford University] (2025) https://t.co/GljQ2pE1tK https://t.co/yLY3m7H33P
Running LLMs locally seemed impossible until Ollama came along. Our 2023 deep-dive on scaling these models is even more relevant today as edge AI takes off. Learn how we tackled memory constraints, optimized inference, and why running AI on your own hardware still matters in 2025… https://t.co/e3JNSJrwkJ
Exciting New Research: Injecting Domain-Specific Knowledge into Large Language Models I just came across a fascinating comprehensive survey on enhancing Large Language Models (LLMs) with domain-specific knowledge. While LLMs like GPT-4 have shown remarkable general capabilities,… https://t.co/xG1rkfUhVl