Nov 27, 07:28 PM

NVIDIA Unveils Star Attention Algorithm for Efficient Long-Context Inference in LLMs, Achieving 11x Speedup and 95-100% Accuracy

NVIDIA has introduced a new AI algorithm called Star Attention, designed to enhance the efficiency of long-context inference in large language models (LLMs). This innovative mechanism employs a two-phase attention approach that combines blockwise-local attention for encoding context with sequence-global attention for query processing. The algorithm reportedly achieves up to an 11x speedup while maintaining an accuracy rate of 95-100% through a block-sparse approximation method. Additionally, NVIDIA's Hymba model is noted for setting new standards in small language models, outperforming all sub-2 billion parameter models and rivaling the Llama-3.2-3 billion model. Furthermore, TensorRT-LLM's Multiblock Attention has demonstrated a threefold increase in throughput for long-sequence AI inference on NVIDIA HGX H200 GPUs, enhancing performance without the need for additional hardware.

#NVIDIA #Star Attention #Hymba #Llama #Multiblock Attention #NVIDIA HGX H200

Written with ChatGPT (GPT-4o mini).

NVIDIA Unveils Star Attention Algorithm for Efficient Long-Context Inference in LLMs, Achieving 11x Speedup and 95-100% Accuracy

Sources

Additional media

Similar Stories

Similar Stories