🤔 10 million tokens ≈ 20,000 pages, roughly. Ever thought how they can achieve 10 million tokens input in training? Basically, they do some mathematical hacks in the architecture to achieve it, where the model can select only the needed context from a large amount of input https://t.co/IYPcoEMajn
Standard attention struggles locating context based on multiple criteria, using only single query-key vector similarities. This paper introduces Multi-Token Attention (MTA), allowing attention weights to depend on neighboring queries, keys, and heads simultaneously via https://t.co/wh3fAPMBvH
Small models with huge context lengths are gonna be a real disappointment when you actually try to use it. The biggest (multi trillion parameter) models degrade quickly. The huge context lengths are made possible by attention mechanisms that allow them to cheat.
Recent research has shed light on the behavior of large language models (LLMs), particularly their tendency to focus excessively on the first token in a sequence, a phenomenon termed 'attention sink.' This behavior can hinder the effective mixing of information between tokens, potentially leading to representational collapse in deep Transformer architectures. A new paper proposes a Multi-Token Attention (MTA) method, which employs convolution operations over multiple query and key vectors to enhance context relevance identification. This approach reportedly outperforms traditional Transformer models in language modeling and long-context tasks. The findings suggest that conventional attention mechanisms struggle to locate context based on multiple criteria, as they rely solely on single query-key vector similarities. The research indicates a need for improved architectural strategies to manage large context lengths effectively, with implications for the development of future LLMs.