2.1k stars, 2+ million downloads, and 7000+ models on Huggingface later, and I am officially ready to retire my long-time project AutoAWQ ⚡️ Proud to say that AutoAWQ has been adopted by the @vllm_project and will now be maintained by 55+ contributors 🥳 https://t.co/Do8734uWxu
NVIDIA just launched Describe Anything. This new vision-language model doesn’t just caption images; it narrates exactly what’s happening, where, and why it matters, down to pixel-specific regions in both images and videos. https://t.co/PzNRRVBOwH https://t.co/uQs0PI1V53
🚀 We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM! 📈 MamayLM surpasses all similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models. 📊 MamayLM is a 9B model that can run on a single GPU, https://t.co/v8dKHX56SL
Nvidia has introduced Eagle 2.5, a family of vision-language models (VLMs) designed for long-context multimodal learning. The Eagle 2.5-8B model, with 8 billion parameters, matches the performance of larger models such as GPT-4o and Qwen2.5-VL-72B on long-video understanding tasks. Eagle 2.5-8B achieves top results on several benchmarks, including 6 out of 10 first-place finishes on long video benchmarks, outperforming GPT-4o on 3 out of 5 video tasks, and surpassing Gemini 1.5 Pro on 4 out of 6 video tasks. It also leads on an hour-long video benchmark. The model natively supports long-context processing without any compression module and is trained using the Eagle-Video-110K dataset. Eagle 2.5 uses information-first sampling and progressive post-training to efficiently handle longer input sequences. The model maintains over 60% of the original image area and employs strategies like focal prompt and gated cross-attention for improved multimodal understanding. Alongside Eagle 2.5, Nvidia has released Describe Anything, a 3 billion parameter vision-language model (DAM-3B) for detailed localized image and video captioning. Describe Anything integrates full-image or video context with fine-grained local details and is open-sourced on Hugging Face. Users can specify regions for captioning using points, boxes, scribbles, or masks. The release includes the model, dataset, benchmark, and a demo.