Sources
Additional media







Cerebras Systems has achieved a significant milestone in AI performance with the deployment of Meta's Llama 3.1 405B model, which operates at an output speed of 969 tokens per second. This performance is reported to be 12 times faster than OpenAI's GPT-4o and 18 times faster than Anthropic's Claude 3.5 Sonnet. The system boasts an impressive time-to-first token of just 240 milliseconds and supports a context length of 128,000 tokens with 16-bit weights. Cerebras is also preparing to launch a public inference endpoint, expanding access to this advanced AI capability. The rapid performance improvements have positioned Cerebras as a leader in the AI inference market, outpacing competitors such as AWS and Nvidia. Recent benchmarks indicate that Cerebras's Llama 3.1 405B model runs nearly twice as fast as the fastest GPU cloud implementation of a significantly smaller model, showcasing the advancements in their Wafer Scale Engine technology.
Groq reduces the inference speed of Llama 3 70B to 3200 tokens per second. Three months ago, Llama 8B was 750 tokens per second. The improvement is so rapid that the next generation of hardware will be released soon. ---- I share my learning journey here, join me and let's… https://t.co/x384Cl7PTR
“To put it into perspective, Cerebras ran the 405B model nearly twice as fast as the fastest GPU cloud ran the 1B model. Twice the speed on a model that is two orders of magnitude more complex.” Bonkers-level inference performance🤯 https://t.co/rJulxjEw72
969 tok/sec from @CerebrasSystems. Very impressive!🎉 https://t.co/dWhPRMYJuQ