Yehh thats what I am talking about. 1M context window open-source LLM in the market from Shanghai. 🔥👀 https://t.co/voFmozPIt6
2mn context window LLMs are already facts of today. 100mn context window are in very advanced stage of work in progress. There are some popular papers from big labs that have found ways to do 1-Bn context window.. So RAG's long-term (5Years+) future is uncertain. https://t.co/MpVRLN9yZb https://t.co/kXJTtazeOo
we've got a 1M context window open-source llm from a shanghai lab out in the wild and you're still stuck with your rag pipeline? @intern_lm https://t.co/sInjZaA8rL
Microsoft has introduced a new caching technique for language models called 'You Only Cache Once' (YOCO), which optimizes memory usage by storing key-value pairs just once. This innovation aims to make large language models (LLMs) more efficient. Additionally, MemLong, another solution, allows LLMs to handle up to 80,000 tokens on a single GPU by storing past context in memory banks, significantly enhancing their processing capabilities by processing 20x more context. In China, Alibaba's Qwen and a new F5-TTS model have shown significant advancements in open AI. Furthermore, a Shanghai lab has released a 1 million context window open-source LLM, with future developments aiming for even larger context windows, potentially reaching up to 100 million or 1 billion tokens. Another method, KV Cache Compression, retains 97% performance by allocating cache budgets based on head importance.