Hertz-Dev and Fish Agent v0.1 are two new open-source models that enhance capabilities in conversational AI and voice cloning. Hertz-Dev, developed by Standard Intelligence, is an 8.5 billion parameter transformer model designed for real-time conversational audio, achieving a theoretical latency of 80 milliseconds and a real-world latency of 120 milliseconds on a single NVIDIA RTX 4090. It is trained on 20 million hours of high-quality audio data but is presented as a base model without fine-tuning or reinforcement learning. In parallel, Fish Agent v0.1, released by Fish Audio, features a compact architecture with 3 billion parameters and supports zero-shot voice cloning. It is trained on 700,000 hours of multilingual audio, allowing for text and audio input with ultra-fast 200 milliseconds time-to-first-audio. Both models represent significant advancements in their respective fields, with Hertz-Dev focusing on conversational AI and Fish Agent v0.1 targeting multilingual voice synthesis.
Fish Agent v0.1 3B Released: A Groundbreaking Voice-to-Voice Model Capable of Capturing and Generating Environmental Audio Information with Unprecedented Accuracy https://t.co/S7BzfIixIR #VoiceToVoice #AIInnovation #SpeechSynthesis #TTSChallenges #MultilingualAI #ai #news #ll… https://t.co/S5Ta5OfMEl
Fish Agent v0.1 3B Released: A Groundbreaking Voice-to-Voice Model Capable of Capturing and Generating Environmental Audio Information with Unprecedented Accuracy The Fish Audio Team has recently unveiled Fish Agent v0.1 3B, an innovative solution designed to address these… https://t.co/Fhadbdt0ga
Wow! New Speech to Speech model - Fish Agent v0.1 3B by @FishAudio 🔥 > Trained on 700K hours of multilingual audio > Continue-pretrained version of Qwen-2.5-3B-Instruct for 200B audio & text tokens > Zero-shot voice cloning > Text + audio input/ Audio output > Ultra-fast… https://t.co/UvdwxGUm4w