Jun 16, 07:44 PM

DeepSeek 671B Runs at 7,583 Tokens/Sec on NVIDIA GB200 NVL72; Groq Delivers Qwen3-32B 131K Context Window on Hugging Face

Recent advancements in large language model (LLM) technology and hardware acceleration have significantly improved AI inference speeds and capabilities. The SGLang team demonstrated running DeepSeek 671B on NVIDIA's GB200 NVL72 GPU, achieving 7,583 tokens per second per GPU, which is 2.7 times faster than the previous H100 model. This breakthrough was enabled by parallel decoding disaggregation and large-scale expert parallelism. Additionally, Groq has integrated its inference technology into the Hugging Face Playground and API, enabling faster execution of state-of-the-art models such as Llama 4 and Qwen 3. Groq is currently the only provider delivering Qwen3-32B’s full 131,000-token context window at real-time speeds, with pricing set at $0.29 per million input tokens and $0.59 per million output tokens. These developments are expected to lower the cost per token and accelerate the deployment of AI applications like smarter agents and real-time copilots. Furthermore, new tools like Davia allow models like ChatGPT and Claude to generate complete web applications within chat interfaces, enhancing developer productivity.

#SGLang #DeepSeek #NVIDIA #Groq #Hugging Face Playground #API #Davia #ChatGPT #Claude

Written with ChatGPT (GPT-4).

Sources

Additional media

Image #1 for story deepseek-671b-runs-7583-tokens-sec-on-nvidia-gb200-nvl72-groq-delivers-qwen3-32b-eed5d9b1

Image #2 for story deepseek-671b-runs-7583-tokens-sec-on-nvidia-gb200-nvl72-groq-delivers-qwen3-32b-eed5d9b1

DeepSeek 671B Runs at 7,583 Tokens/Sec on NVIDIA GB200 NVL72; Groq Delivers Qwen3-32B 131K Context Window on Hugging Face

Sources

Additional media

Similar Stories