🏷️:LLaVA-o1: Let Vision Language Models Reason Step-by-Step 🔗:https://t.co/ayq1E8NMJr https://t.co/uzDH5ubMMN
LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1! 🤯 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks. https://t.co/7g0uCAGCrF
LLaVA-o1, a VLM designed to conduct autonomous multistage reasoning. source: https://t.co/R3Je531zrt "Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation."…









Researchers from Peking University, Tsinghua University, Peng Cheng Laboratory, Alibaba DAMO Academy, and Lehigh University have introduced LLaVA-o1, a novel visual language model (VLM) capable of spontaneous and systematic reasoning. LLaVA-o1 distinguishes itself by conducting autonomous multistage reasoning through four sequential stages: summarization, visual interpretation, logical reasoning, and conclusion generation. This innovative approach contrasts with traditional chain-of-thought prompting methods. The model, which is fine-tuned from the Llama-3.2-11B-Vision-Instruct, has demonstrated superior performance across six multimodal reasoning benchmarks, outperforming larger models such as Gemini-1.5-pro, GPT-4o-mini, and even some closed-source alternatives. LLaVA-o1 is trained on 100,000 samples and employs a stage-level beam search to generate and select optimal answers at each reasoning stage.