Recent advancements in the field of large language models (LLMs) have focused on improving their handling of long-context inputs. Key developments include the introduction of the Pluto and Charon (PAC) framework, which achieves significant speedup and memory reduction in fine-tuning LLMs. This framework offers up to 8.64x speedup and an 88.16% reduction in memory footprint. Another notable approach is the use of early layers as filters to select and compress input tokens, achieving a 1000x reduction in input tokens. Additionally, a new method involving 2:4 structured sparsity masks allows for efficient fine-tuning by freezing the LLM and learning binary masks for each linear layer. These innovations are aimed at addressing the challenges of long-context processing, making LLMs more efficient and scalable. A multilingual evaluation revealed significant performance drops with increased context length, multiple target sentences, and lower-resource languages. Some of these advancements have been recognized as a spotlight at NeurIPS.
Introducing 💎GemFilter💎, a simple, training-free and broadly applicable approach to address long-context bottleneck, accelerating #LLM inference and reducing GPU memory consumption with 1000x token reduction 📘 Paper: https://t.co/qGiU9VyypG 🧠 Code: https://t.co/tUtCshkXgy https://t.co/GjB3z0yeps
Multilingual Evaluation of Long Context Retrieval and Reasoning Evaluates long-context LLMs across five languages, revealing significant performance drops with increased context length, multiple target sentences, and lower-resource languages. 📝https://t.co/oRQkt27hYt https://t.co/W2RfB01AjJ
🚀 @NeurIPSConf Spotlight! 🥳 Imagine fine-tuning an LLM with just a sparsity mask! In our latest work, we freeze the LLM and use 2:4 structured sparsity to learn binary masks for each linear layer. Thanks to NVIDIA Ampere’s 2:4 sparsity, we can achieve up to 2x compute… https://t.co/la6fxUpxOM