Ultravox, a new open-source multimodal language model developed by FixieAI, is gaining attention for its real-time voice capabilities. The latest version, Ultravox v0.4.1, features an 8 billion parameter model that reportedly approaches the performance level of GPT-4o. It is designed to understand both text and human speech without requiring separate automatic speech recognition (ASR) systems. The model is pre-trained on Llama3.1-8b and a 70 billion parameter backbone, and it supports text output with a response time of approximately 150 milliseconds. Additionally, Ultravox is available with MIT licensed checkpoints, making it accessible for further development and training with various adapters, including Whisper as an audio encoder.
Ultravox v0.4.1 is an open-source multimodal real-time speech model with speech understanding performance close to GPT-4o. It directly understands text and human speech without the need for separate ASR and currently supports text output The first response time is 150… https://t.co/N5igM0omtS
Audio LMs scene is heating up! 🔥 @FixieAI Ultravox 0.4.1 - 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder, profit 💥 Bonus: MIT licensed checkpoints > Pre-trained on Llama3.1-8b/ 70b backbone as well as the encoder part of… https://t.co/uB6FNJZvRb
ultravox - A fast multimodal LLM for real-time voice https://t.co/P7SfO9T0UF