Researchers from Peking University, Tsinghua University, Peng Cheng Laboratory, Alibaba DAMO Academy, and Lehigh University have introduced LLaVA-o1, a novel visual language model (VLM) capable of spontaneous and systematic reasoning. LLaVA-o1 distinguishes itself by conducting autonomous multistage reasoning through four sequential stages: summarization, visual interpretation, logical reasoning, and conclusion generation. This innovative approach contrasts with traditional chain-of-thought prompting methods. The model, which is fine-tuned from the Llama-3.2-11B-Vision-Instruct, has demonstrated superior performance across six multimodal reasoning benchmarks, outperforming larger models such as Gemini-1.5-pro, GPT-4o-mini, and even some closed-source alternatives. LLaVA-o1 is trained on 100,000 samples and employs a stage-level beam search to generate and select optimal answers at each reasoning stage.
🏷️:LLaVA-o1: Let Vision Language Models Reason Step-by-Step 🔗:https://t.co/ayq1E8NMJr https://t.co/uzDH5ubMMN
LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1! 🤯 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks. https://t.co/7g0uCAGCrF
LLaVA-o1, a VLM designed to conduct autonomous multistage reasoning. source: https://t.co/R3Je531zrt "Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation."…