Apr 11, 01:31 PM

Shanghai AI Lab Launches InternVL3 Multimodal LLM with 1B to 78B Parameters, Enhanced Capabilities Compared to InternVL 2.5

The Shanghai AI Lab has released InternVL3, a new multimodal language model (LLM) designed to enhance perception and reasoning capabilities compared to its predecessor, InternVL 2.5. InternVL3 features a modular architecture that includes a Vision Encoder (InternViT) capable of processing multi-image and video inputs. The model is available in various sizes, ranging from 1 billion to 78 billion parameters, and is built on the InternViT encoder and Qwen2.5VL decoder. It aims to improve performance in reasoning, document tasks, and extend functionalities to tool use and agentic capabilities. Additionally, the model is compatible with transformers, facilitating ease of use. The release is part of a broader initiative to test and compare various open-source vision-language models (VLMs) in a playground environment, where models like PaliGemma and DeepSeek-VL will also be introduced soon. However, the development team faced challenges, including a ban from OpenAI regarding the use of GPT-4o for

#Shanghai AI Lab #InternVL3 #InternVL #Vision Encoder #InternViT #PaliGemma #OpenAI #GPT

Written with ChatGPT (GPT-4o mini).