Hugging Face has announced the release of TGI v3, a significant upgrade to its Text Generation Inference tool. This new version boasts the capability to process three times more tokens, handling up to 30,000 tokens on a single NVIDIA L4 GPU when using the Llama 3.1-8B model. Additionally, TGI v3 operates 13 times faster than its predecessor, vLLM, processing 200,000 token prompts in just two seconds compared to 27.5 seconds for vLLM. The upgrade also emphasizes ease of use, requiring zero configuration, as the flags have been optimized for users. This release represents a notable advancement in large language model inference processing, enhancing both performance and efficiency.
3x more tokens and 13x faster generations than vLLM? 👀 @huggingface TGI 3.0 released! 🎉TGI 3.0 dramatically improves LLM inference processing by 3x more input tokens, running 13x faster than vLLM on long prompts while requiring zero configuration! TL;DR: 🚀 Processes 3x more… https://t.co/z9zcG30HD2
Text Generation Inference TGI v3.0 is just released by @huggingface. Processing 3x more tokens and running 13x faster on long prompts, while requiring zero configuration → The memory optimization enables a single NVIDIA L4 GPU (24GB) to handle 30k tokens for Llama 3.1-8B,… https://t.co/Hhqdxt1NVx
🚀 TGI v3 is Here! 🚀 The latest release of Text Generation Inference is packed with groundbreaking updates. Here’s what you need to know: ✨ 3x More Tokens: Now you can handle up to 30k tokens on a single L4 GPU with Llama 3.1-8B. More power, less memory. ⚡ 13x Faster: 200k+… https://t.co/1SrnBUO1k6