The research community has introduced Sequoia, a new speculative decoding framework designed to efficiently serve the Llama2-70B model on a single RTX4090 graphics card with a latency of approximately half a second per token, with no approximation. This framework, which utilizes 16-bit precision while maintaining the original output distribution, marks a significant improvement over previous methods such as DeepSpeed, which required 5.3 seconds per token, and setups involving 8 x A100 GPUs, which achieved 25ms per token but at a much higher cost. Sequoia's ability to scale speculative decoding to very large speculation budgets, adapt to various hardware configurations, and maintain robustness across different decoding setups has been highlighted as more scalable. Its introduction is seen as a breakthrough in making large AI models more accessible and runnable on consumer-grade GPUs, potentially enabling more localized AI applications and enhancing model performance. The dynamic programming algorithm used to construct the speculative tree has been noted for its cleverness.
TL;DR - Together Sequoia shows a way to speed up Llama2-70B and be able to run this on a single consumer GPU with 8x speed up. Being able to run AI locally can mean a few things, it can mean, make smaller models better, and we've seen this again and again for the past year, 13B… https://t.co/lm2GQmTYnL
Excited to announce our new speculative decoding method, Sequoia! Sequoia scales speculative decoding to very large speculation budgets, is robust to different decoding configurations, and can adapt to different hardware. Serve Llama2-70B on one RTX4090 with half-second/token… https://t.co/6F6wmgdGLN
A new algorithm for speculative decoding with large budgets, great for serving large models on consumer GPUs with offloading. The dynamic programing algo to construct the speculative tree is quite clever! https://t.co/O5XlFWB69I