Aug 26, 04:29 PM

Microsoft Releases VibeVoice-1.5B Open-Source TTS Model Generating 90-Minute Multi-Speaker Audio with Cross-Lingual and Singing Features

Microsoft has released VibeVoice-1.5B, an open-source text-to-speech (TTS) model designed to generate long-form, multi-speaker conversational audio, making it particularly suitable for podcast automation. The model, which has 2.7 billion parameters, can produce up to 90 minutes of audio featuring up to four speakers with stable voices and natural conversational turns. VibeVoice-1.5B integrates Qwen2.5-1.5B with ultra-low-rate acoustic and semantic tokenizers and a diffusion head, enabling features such as cross-lingual translation, background music, spontaneous singing, and emotional expression. Developers have already created a working chat application using VibeVoice-1.5B, deployed on platforms like Hugging Face. The release aligns with broader trends in AI development, including low-code AI frameworks and hands-on training opportunities at upcoming conferences such as ODSC West 2025, which will cover AI engineering, agent operations, retrieval-augmented generation, and related technologies.

#Microsoft #VibeVoice #Hugging Face #AI #ODSC West 2025

Written with ChatGPT (GPT-4).