Recent discussions in the machine learning community have highlighted the advancements and applications of FlexAttention, a technique designed to enhance the efficiency of attention mechanisms in neural networks. Notably, one user reported a 64-fold speedup in creating a block mask for a sequence of 65,000 tokens when compiling the 'create_block_mask' function alongside FlexAttention. Another user noted that enabling FlexAttention on the Mistral 7B model doubled the end-to-end token throughput compared to using the cuDNN SDPA with SWA mask. Additionally, the Lingua codebase has been cited as an effective implementation of FlexAttention, demonstrating batched and sequence-stacked attention masking for within-document attention. Furthermore, a new development indicates that LE ATTENTION is now compatible with FlexAttention, allowing for intuitive construction of block-structured attention matrices while optimizing execution through sparsity manipulation. Overall, these innovations reflect significant progress in the optimization of attention mechanisms in modern machine learning frameworks.
This is actually a pretty fun mask_mod. Basically, it's a generalization of "document masking", where each token belongs to a "node", and then two tokens can attend to each other if the node they belong to is connected. Pretty cool :) https://t.co/57eW8XufGE https://t.co/pnnc7Y1wJh
LE ATTENTION is now compatible with FlexAttention! Our DAG-based language makes building block-structured attention matrices intuitive, while FlexAttention optimizes execution through sparsity manipulation! FlexAttention is amazing! @cHHillee https://t.co/uol3opKzeN https://t.co/QC7CB3JrJA
A great example of FlexAttention used in a reasonably modern code base is Lingua. Which is designed to reproduce Llama 2 7B overnight They have a great example of batched / sequence-stacked attention masking for within document attention. Which then is used in the mod function… https://t.co/3W1XoiF0Yt