Zonos, a new open-source text-to-speech (TTS) model developed by ZyphraAI, has been released, featuring high-fidelity voice cloning and expressive speech capabilities. The model includes two variants, each with 1.6 billion parameters, utilizing transformer and SSM architectures. It supports real-time voice cloning using 5 to 30 seconds of audio input, and allows for adjustments in speed, pitch, audio quality, and emotional tone. Additionally, it offers enhanced speaker matching through the addition of text and audio prefixes. The model is reported to run at approximately double the real-time rate on an RTX 4090 graphics card. Zonos also supports multilingual output, broadening its applicability in various linguistic contexts.
Beta Release of Zonos-v0.1 - two expressive and real-time text-to-speech (TTS) models with high-fidelity voice cloning https://t.co/wfhZhgWr1j https://t.co/eXBsCijbOf
Request-for-port: Zonos in MLX / MLX Swift. Let's run high-quality, expressive TTS + voice cloning fast on-device. https://t.co/ktpizrkNYa
Zyphra’s dropping some serious tech! Two 1.6B TTS models and voice cloning with open-weights? Impressive. Definitely checking this out! 🔥🎧 https://t.co/PyxAkYZNfF