Recent advancements in text generation technology have been highlighted with the introduction of dynamic speculative decoding by Intel, which accelerates the process by 2-3 times. This method involves a two-stage generative process, utilizing a smaller, less accurate draft model. Additionally, the DISCO framework enhances speculative decoding speed by 10-100%, now set as the default in the Transformers library for assisted generation. The scaling law for large language models (LLMs) indicates that increased compute and data lead to improved performance, a principle that holds true during inference time as well. New experiments suggest that Superposed Decoding can significantly assist in scaling inference time without raising compute costs. These developments will be further discussed at the upcoming NeurIPS Conference in Vancouver.
✨ New and exciting updates for Superposed Decoding! Helps inference time search drastically without increasing the compute cost 🚀 Check out the🧵 by @ethnlshn and the updated paper at https://t.co/eUL1bS5JK6 See you all at Vancouver @NeurIPSConf ! https://t.co/xAAqdhTCKj
Can Superposed Decoding assist in inference time scaling 📈? In new experiments, we show the answer is a resounding yes! (1/3) https://t.co/Wo4IqNM9tU
inference-time compute should pareto frontier the shit outta everything i imagine a future where our best models are like 5b params and use not a token more than absolutely necessary depending on task difficulty everything will be wayyyy cheaper