Google DeepMind has launched SigLIP 2, a next-generation multilingual vision-language encoder, available on Hugging Face. The new model enhances semantic understanding and localization, integrates dynamic resolution for aspect-ratio-sensitive tasks, and outperforms its predecessor in zero-shot classification, retrieval, and question answering. Key improvements include better location awareness and stronger local representations. Initial testing of the model shows promising results, with top-1 accuracy scores of 73.9% and 78.4% for different configurations. The release of SigLIP 2 is expected to make it a preferred choice for various vision tasks.
Visual language model trained with reinforcement learning https://t.co/SgNyzhXaVc
Google DeepMind Research Releases SigLIP2: A Family of New Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Google DeepMind Research Releases SigLIP2: a family of new multilingual vision-language encoders with Improved… https://t.co/RSyDsIlsb4
Okay, SigLIP 2 weights for OpenCLIP and timm (image encoder only) are on the @huggingface hub. Merged to main, release probably this weekend. I tested the IN-1k zero-shot and these are the OpenCLIP numbers: B/32 256 top1: 73.9 top5: 93.4 B/16 224 top1: 78.4 top5: 95.7 B/16 256…