Apr 23, 05:58 AM

Nvidia Launches Eagle 2.5-8B Matching GPT-4o and Open-Sources Describe Anything 3B for Multimodal AI

Nvidia has introduced Eagle 2.5, a family of vision-language models (VLMs) designed for long-context multimodal learning. The Eagle 2.5-8B model, with 8 billion parameters, matches the performance of larger models such as GPT-4o and Qwen2.5-VL-72B on long-video understanding tasks. Eagle 2.5-8B achieves top results on several benchmarks, including 6 out of 10 first-place finishes on long video benchmarks, outperforming GPT-4o on 3 out of 5 video tasks, and surpassing Gemini 1.5 Pro on 4 out of 6 video tasks. It also leads on an hour-long video benchmark. The model natively supports long-context processing without any compression module and is trained using the Eagle-Video-110K dataset. Eagle 2.5 uses information-first sampling and progressive post-training to efficiently handle longer input sequences. The model maintains over 60% of the original image area and employs strategies like focal prompt and gated cross-attention for improved multimodal understanding. Alongside Eagle 2.5, Nvidia has released Describe Anything, a 3 billion parameter vision-language model (DAM-3B) for detailed localized image and video captioning. Describe Anything integrates full-image or video context with fine-grained local details and is open-sourced on Hugging Face. Users can specify regions for captioning using points, boxes, scribbles, or masks. The release includes the model, dataset, benchmark, and a demo.

#Nvidia #Eagle #GPT #Gemini #Describe Anything #Hugging Face

Written with ChatGPT (GPT-4).