Recent advancements in large language model (LLM) technology and hardware acceleration have significantly improved AI inference speeds and capabilities. The SGLang team demonstrated running DeepSeek 671B on NVIDIA's GB200 NVL72 GPU, achieving 7,583 tokens per second per GPU, which is 2.7 times faster than the previous H100 model. This breakthrough was enabled by parallel decoding disaggregation and large-scale expert parallelism. Additionally, Groq has integrated its inference technology into the Hugging Face Playground and API, enabling faster execution of state-of-the-art models such as Llama 4 and Qwen 3. Groq is currently the only provider delivering Qwen3-32B’s full 131,000-token context window at real-time speeds, with pricing set at $0.29 per million input tokens and $0.59 per million output tokens. These developments are expected to lower the cost per token and accelerate the deployment of AI applications like smarter agents and real-time copilots. Furthermore, new tools like Davia allow models like ChatGPT and Claude to generate complete web applications within chat interfaces, enhancing developer productivity.
Live on Hugging Face, we’re the only inference provider delivering Qwen3-32B’s full 131K context window at real-time speeds — for just $0.29/M input tokens, $0.59/M output. 👇 Full story in comments https://t.co/9M9eoxHs4s
Groq just made Hugging Face way faster — and it’s coming for AWS and Google https://t.co/rLRGJ79C0X
.@lmsysorg (SGLang) now achieves 7,583 tokens per second per GPU running @deepseek_ai R1 on the GB200 NVL72, a 2.7x leap over H100. We're excited to see the open source ecosystem advance inference optimizations on GB200 NVL72, driving down cost per token for the industry at https://t.co/oAXZxdyKuF