The Kokoro text-to-speech (TTS) model, with only 82 million parameters, has been released as an open-source solution. This model is noted for its high-quality audio output and efficiency, capable of generating over two minutes and twenty-five seconds of speech in just 4.5 seconds on a T4 GPU. Despite its small size, Kokoro outperforms larger models and is particularly praised for its resource-friendly nature, making it accessible for widespread use. It currently supports English only, but its architecture allows for training in other languages with less than 100 hours of audio data. The model's release has been met with enthusiasm for its potential to simplify the creation of high-quality, faceless AI videos.
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer. https://t.co/006nY5af92
🥳Mini-InternVL has been accepted by Visual Intelligence! The Mini-InternVL series of #MLLMs, with parameter ranges from 1 B to 4 B, achieve 90% of the performance using only 5% of the parameters. This significant efficiency and performance boost makes our model more accessible… https://t.co/tGVyR5iinb
💥 Introducing MiniCPM-o 2.6: An 8B size, GPT-4o level Omni Model runs on device ✨ Highlights: ~Match GPT-4o-202405 in vision, audio and multimodal live streaming ~End-to-end real-time bilingual audio conversation ~Voice cloning & emotion control ~Advanced OCR & video… https://t.co/OMJeXUZSvs