Super interesting. Ichigo-Llama3.1: Local Real-Time Voice AI. Demo on a single NVIDIA 3090 GPU. 🍓 Ichigo Overview: - Latest llama3-s checkpoint - Early-fusion audio and text multimodal model - Open-source codebase, data, and weights https://t.co/Yy0mr1P5ka
Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 We bring 2 key improvements to Ichigo: - It can talk back (Yes!) - It recognizes when it can't comprehend input By the way, it's open source! Github : https://t.co/QJ3SwIBAXg https://t.co/sXO9MemQfd
Multimodal Ichigo Llama 3.1 - Real Time Voice AI 🔥 > WhisperSpeech X Llama 3.1 8B > Trained on 50K hours of speech (7 languages) > Continually trained on 45hrs 10x A1000s > MLS -> WhisperVQ tokens -> Llama 3.1 > Instruction tuned on 1.89M samples > 70% speech, 20%… https://t.co/94fKnAnSp5
A new model named LlaMa-Omni has been introduced, designed to rival GPT-4o for real-time speech interaction with large language models (LLMs). This model can generate both text and speech from speech instructions, achieving a response latency as low as 226 milliseconds. Additionally, the Ichigo-Llama 3.1, a local real-time voice AI, has been unveiled with two significant enhancements: it can now respond verbally and recognize when it cannot comprehend input. The Ichigo model is capable of running on a single NVIDIA 3090 GPU and is open source, featuring the latest llama3-s checkpoint and trained on 50,000 hours of speech across seven languages. The model utilizes an early-fusion audio and text multimodal approach and has been instruction-tuned on 1.89 million samples.