OpenAI Whisper deployment on @huggingface Inference Endpoints, achieving up to 8x faster transcriptions. → The new Whisper endpoint utilizes vLLM for inference, targeting NVIDIA GPUs (Ada Lovelace like L4 & L40s). Optimizations include PyTorch compilation (torch.compile) for https://t.co/0HiyCWaM7G https://t.co/5BhmdKnloZ
Transcribing 1 hour of audio for less than $0.01 🤯 The @huggingface team cooked with 8x faster Whisper speech recognition - @OpenAI whisper-large-v3-turbo transcribes at 100x real time on a $0.80/hr L4 GPU! https://t.co/JvEjIlH8bL
Wow, 8x faster transcription with Whisper Large V3! https://t.co/e6hWF0jOyH
Hugging Face has introduced a new deployment of OpenAI's Whisper speech recognition model on its Inference Endpoints platform, achieving transcription speeds up to eight times faster than previous versions. This enhancement leverages the vLLM project for optimized inference on NVIDIA GPUs such as the Ada Lovelace L4 and L40s, incorporating PyTorch compilation techniques. The upgraded Whisper Large V3 model transcribes audio at 100 times real-time speed on a $0.80 per hour L4 GPU, enabling the transcription of one hour of audio for less than one cent. These improvements significantly reduce latency and cost without requiring any configuration changes from users. Additionally, PortkeyAI has rolled out optimizations on its Gateway that also reduce latency for OpenAI embeddings, further enhancing response times.