
PyTorch has introduced a new API called FlexAttention, designed to simplify the implementation of various attention mechanisms in machine learning models. This API allows users to implement diverse attention variants using a few lines of idiomatic PyTorch code. FlexAttention supports arbitrary masks and biases, exploits blockwise sparsity for speed improvements, and can express complex attention mechanisms such as Gemma2 soft-capping and neighborhood attention. Additionally, it enables the packing of mixed-size images or text into a single context. The API also accepts a user-defined function, score_mod, which processes the attention score and position of two tokens, enhancing flexibility for developers. This development aims to overcome the limitations of previous fused attention implementations, providing a more versatile tool for machine learning practitioners.
Oh this is neat. FlexAttention, implement many of the attention variant/kernels in a few lines of PyTorch with a Pytorch API. FlexAttention accepts a user-defined function, score_mod, which is given the "attention score" of two tokens as well as the "position" of this score. https://t.co/x1dljQ9Ray https://t.co/h11zFSusj6
PyTorch has an official attention gym never skip masks day https://t.co/qvjVVW5oDM https://t.co/iRIDDSFpwm
Torch compile + Triton + FlashAttention algo = GPU go brrr for all attention variants All my favorite things coming together https://t.co/4b0n3DOYKP