Dec 17, 05:04 AM

Alibaba Introduces CosyVoice 2 for Real-Time Multilingual Speech Synthesis Using Supervised Discrete Speech Tokens

A series of recent publications have highlighted advancements in speech synthesis and audio processing technologies. Notably, 'CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models' has been introduced by researchers from Alibaba. This model enhances multilingual speech synthesis by utilizing supervised discrete speech tokens and incorporates finite-scalar quantization to optimize codebook utilization. The improvements aim to facilitate real-time, interactive applications. Other notable works include 'CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder' and 'JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music Generation.' These developments reflect ongoing innovations in the fields of audio technology and machine learning, as researchers explore new methods for enhancing voice and music synthesis.

#Alibaba

Written with ChatGPT (GPT-4o mini).

Alibaba Introduces CosyVoice 2 for Real-Time Multilingual Speech Synthesis Using Supervised Discrete Speech Tokens

Sources

Additional media

Similar Stories

Similar Stories