In early June 2025, PlayAI open-sourced PlayDiffusion, a diffusion-based large language model (LLM) for audio speech editing, under the Apache 2.0 license. PlayDiffusion enables dynamic, fine-grained editing of speech audio without regenerating entire clips, preserving prosody, timing, and speaker identity. The model supports super-fine in-painting edits, zero-shot voice cloning, and generates audio in just 20-30 tokens, representing up to a 50-fold efficiency improvement over traditional autoregressive models. Zoom has expanded its AI Companion with over 45 new features, including improved real-time transcription, automated meeting summaries, action item detection, multilingual support, and deeper integration with business applications. These enhancements have contributed to an upward revision of Zoom's annual profit forecasts. AssemblyAI released a new real-time transcription model that offers greater speed and accuracy for speech-to-text applications. Speak AI has improved speaker identification and labeling for meetings, now supporting over 100 languages, with more than 200,000 users. Speak AI claims 95%+ transcription accuracy, 80%+ time savings, and a 4.9 G2 rating, and integrates with platforms such as Zoom, Microsoft Teams, Webex, and Google Meet. These developments demonstrate the rapid integration of advanced AI technologies into speech editing, transcription, and meeting productivity tools for professional and research use cases.
Play AI 发布了一个基于扩散模型的文本转语音模型 优势是支持单独修改生成的一整段话中的某个单词 它不像自回归(AR)模型那样逐个生成标记,而是一次性预测所有标记,并在大约 20 个去噪步骤中进行优化。 使得生成步骤的效率提高了最多 50 倍,同时没有任何质量损失。 https://t.co/J0B2F5G05p
My voice sounds better to me recorded than in my own head. I guess for most people it's the opposite.
Most voice models can generate speech, but can't edit it. PlayDiffusion is a diffusion-based voice model built for editing speech. Instead of re-generating entire clips, it masks the target span and rewrites just that. It keeps prosody, timing, and speaker identity intact. https://t.co/DbIl7sMHfY