
Neural Magic has unveiled the 2:4 Sparse Llama 3.1 8B model, a significant advancement in language model efficiency. This model, which is 50% pruned, is designed for optimal GPU inference performance and demonstrates a 98% recovery rate on the Open LLM Leaderboard v1. It also achieves full recovery on various fine-tuning tasks, including math, coding, and chat applications. The Sparse Llama 3.1 8B is open-source and aims to enhance the capabilities of large language models (LLMs) while reducing resource requirements. Additionally, researchers have been exploring fine-tuning techniques for Llama 3.1 8B, achieving notable improvements in performance on the CoQA conversational dataset, with some reports indicating a twofold increase in exact match scores. These developments reflect ongoing efforts in the AI community to optimize LLMs for practical applications, particularly in multi-turn conversational contexts.






How do you teach an LLM to carry long form conversations and not get confused by all the details? To learn how we can improve fine-tuning over long form conversational data I fine-tuned a bunch of models on the CoQA dataset and 2x'd performance! Full code notebook below🔽 https://t.co/MHlbEZQXHd
yo! @NVIDIAAIDev finally released the weights for Hymba-1.5B - outperforms Llama, Qwen, and SmolLM2 with 6-12x less training trained ONLY on 1.5T tokens > massive reductions in KV cache size and improved throughput > combines Mamba and Attention in a hybrid parallel… https://t.co/H5qxTpUX16
New Cookbook: Fine-tuning Llama 3.1 8B on Conversation Data In this notebook we fine-tune Llama 3.1 8B on the CoQA multi-turn conversation dataset with loss masking and show a significant improvement in performance! Read below: https://t.co/aB3gQp2JsX