Mar 13, 04:21 PM

Groq Inc. Introduces Gemma 7B API at $0.1/M tokens; Launches Sequoia Decoding Framework for Llama2-70B on RTX4090, Resulting in 2.4X More Queries

Groq Inc. has unveiled Gemma 7B API, achieving a record 814 tokens/s throughput in LLM Inference APIs. They charge $0.1/M tokens. Additionally, they introduced a new speculative decoding framework, Sequoia, capable of serving Llama2-70B on RTX4090 with half-second/token latency. Sequoia is scalable with large speculation budgets and adaptable to different hardware. The framework aims to speed up Llama2-70B, enabling it to run on a single consumer GPU with 8x speed up. Anyscale has partnered with NVIDIAAI to optimize model inference for generative AI image-generation workloads, resulting in 2.4X more queries per second.

#Gemma 7B API #LLM Inference #Sequoia #Anyscale #NVIDIAAI

Written with ChatGPT (GPT-3).

Sources

Groq Inc@GroqInc
2 years ago
"Even compared to other cloud installs of Gemma the Groq installation is impressively fast. It beats out ChatGPT, Claude 3 or Gemini in response time" -- @RyanMorrisonJer https://t.co/UBPHJwzpE1
Noah Santoni@TheArk_Master
2 years ago
500 Tokens a Second - Google SWOT in 0.45 Seconds vs 30 Seconds We tested Groq comparing its speed against our own SWOT Summary Prompts (DM me if you want the Prompt) Groq vs. Anthropic's New Claude 3 Opus🤺 @Flowise_AI has just integrated @GroqInc into their platform, and…
Anyscale@anyscalecompute
2 years ago
Minimizing latency for LLM serving is challenging. You have to optimize model inference to use GPUs efficiently ⚡. Anyscale has partnered with @NVIDIAAI to enable 2.4X more queries per second 🚀 for a #generativeAI image-generation workload. Learn the details in the blog and try…

Additional media

Image #1 for story groq-inc-introduces-gemma-7b-api-0-1-m-tokens-launches-70b-2

Groq Inc. Introduces Gemma 7B API at $0.1/M tokens; Launches Sequoia Decoding Framework for Llama2-70B on RTX4090, Resulting in 2.4X More Queries

Sources

Additional media

Similar Stories