
JinaAI has developed a new image-text contrastive pretraining technique that improves the text encoder and outperforms previous CLIP models on the MTEB benchmark. The new model, Jina-CLIP, is open-source and available via Transformers and Transformers.js. It supports text-text retrieval and MuRAG, offering a significant improvement over OpenAI's CLIP. Meanwhile, TogetherCompute has introduced Dragonfly, a set of vision-language models that leverage multi-resolution encoding and zoom-in patch selection to enhance fine-grained visual understanding. These models have shown superior performance in medical image understanding, even outperforming Med-Gemini. The Dragonfly models were developed in collaboration with James Y. Zou and include advanced features such as encoding images at multiple resolutions and selecting salient patches for language models, achieving state-of-the-art results in medical captioning. Additionally, a new approach called Cluster Masking enhances visual-language contrastive learning.
New vision language models from the research team @togethercompute and @james_y_zou. With multi-resolution patches and fine-tuning on biomedical image-instruction data, it can even out-perform Med-Gemini! https://t.co/2z14g7RLog
New vision language models from the research team @togethercompute and @james_y_zou. With multi-resolution patches and fine-tuning on biomedical image-instruction data, it can even out-perform Med-Gemini! https://t.co/qrroChAPGk
Dragonfly are a family of new multimodal models from @togethercompute including one that outperforms on medical image understanding. https://t.co/TjQmDVAI4a


