The only way devices like the Human pin or Rabbit are going to fulfill their promises is with Vision-text-model running as locally as possible! Small (open-source) VLMs to the rescue!! Here come Idefics2, small efficient and SOTA multimodal vision-plus-text model! - 8b… https://t.co/2su4OHATGw https://t.co/EJk7ktVXiz
Today we release Idefics2 our newest 8B Vison-Language Model! 💪 With only 8B parameters Idefics is one of the strongest open models out there 📋 We used multiple OCR datasets, including PDFA and IDL from @wightmanr and @m_olbap, and increased resolution up to 980x980 to improve… https://t.co/JS9adw5upG
New open source Vision LLM in town 🤠 https://t.co/eW0qsTX89g
The newly released Idefics2, an 8-billion parameter Vision-Language Model (VLM), is making waves in the tech community for its robust capabilities and efficiency. Developed with a focus on OCR, document understanding, and visual reasoning, Idefics2 utilizes multiple OCR datasets, including PDFA and IDL, and supports increased resolution up to 980x980. It's designed to be competitive with larger 30-billion parameter models, offering significant improvements such as a 12-point increase in VQAv2 and a 30-point increase in TextVQA, while requiring ten times fewer parameters than its predecessor, Idefics1. The model is also noted for its open-source nature under the Apache 2.0 license, its adaptability on less powerful GPUs, making it accessible for broader use, and features an instruction variant that processes image and text inputs to generate text outputs. Additionally, Idefics2 is fully open with training data transparency, enhancing its appeal.