A series of 12 groundbreaking research papers have been published in 2025, focusing on enhancing the performance of large language models (LLMs). These papers, released within the first 50 days of the year, explore new methods to extend the context length and refine the core architecture of LLMs. Key innovations include the DarwinLM method, which utilizes evolutionary structured pruning to achieve a 2x reduction in model size with only a 3% performance loss across various tasks. This approach combines an evolutionary search process with a lightweight training procedure to identify optimal model substructures. Another advancement is the InfiniteHiP framework, which extends the LLM context length to 3 million tokens on a single GPU. This is achieved through hierarchical token pruning, adaptive adjustments to Rotary Position Embeddings (RoPE), and efficient memory management techniques within the SGLang system. LongRoPE, a technique that modifies RoPE to handle context windows beyond 2 million tokens, has also been introduced. It scales down high-frequency components of RoPE embeddings as context length increases, maintaining low perplexity scores across various evaluation lengths and achieving over 90% accuracy in tasks requiring long contexts. This research also explores the concept of Intrinsic Space, Scaling Laws, optimal context length, and theoretical bounds for context length scaling.
We’re not yet at the point where a single LLM call can solve many of the most valuable problems in production. As a consequence, practitioners frequently deploy compound AI systems composed of multiple prompts, sub-stages, and often with multiple calls per stage. These systems'… https://t.co/WpBuTYkf3B
LLMs https://t.co/IgVaPxgc4c
The only Benchmark that is important from now on. I love it. https://t.co/GhkVKxgv3R