Apple has launched AIMv2, a new family of open-set vision encoders aimed at enhancing multimodal understanding and object recognition tasks. The AIMv2 models draw inspiration from existing frameworks like CLIP and incorporate autoregressive techniques to improve performance. Additionally, Apple introduced CoreML models that reportedly match the zero-shot performance of OpenAI's ViT-B/16 while being 4.8 times faster and 2.8 times smaller. The new models are available in an iOS app, allowing users to run them directly on iPhones. Performance tests on various quantization levels of Apple MLX models showed significant speed variations, with 3bit models achieving 29.00 tokens per second and 8bit models reaching 13.68 tokens per second. These advancements reflect Apple's ongoing commitment to innovation in artificial intelligence and machine learning technologies.
LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models https://t.co/PQyW8vpbQg by Sergio De Simone
A preview of MMLU PRO scores in Computer Science for Qwen2.5-72B-Instruct using Apple MLX: • 4bit: 78.78% (4h 40m) - 323/410 correct answers • 3bit: 71.46% (3h 38m) - 293/410 correct answers In brackets testing time, yes you read it well: 8 hours to test 2 quantizations 👀… https://t.co/YUlmJp9Js6
LLaVA-o1 teaches machines to think step-by-step like humans when analyzing images. LLaVA-o1 introduces a novel approach to enhance Vision Language Models (VLMs) by implementing structured, multi-stage reasoning. This paper tackles the challenge of systematic reasoning in visual… https://t.co/jrSutB9Dat