Researchers from Tsinghua University and Shanghai AI Lab have introduced a novel approach called Test-Time Reinforcement Learning (TTRL) that enables large language models (LLMs) to learn and improve using unlabeled data during inference. This method leverages the priors of pre-trained models and employs a majority voting reward function based on consensus among model predictions, allowing LLMs to self-evolve without access to ground-truth labels. TTRL has demonstrated notable performance improvements, such as a 159% increase in the Qwen-2.5-Math-7B model's pass@1 score on the AIME 2024 benchmark. Meanwhile, other research indicates that reinforcement learning (RL) fine-tuning in LLMs does not necessarily enhance reasoning capacity beyond the base model but rather improves sampling efficiency. Distillation methods, in contrast, appear to expand reasoning capabilities more effectively. Additional studies from institutions including the University of Michigan, UC Berkeley, MIT, Carnegie Mellon University, New York University, Google Research, Duke University, Together AI, and Google DeepMind have explored various aspects of LLM training, reasoning, safety, and efficiency. These investigations highlight ongoing efforts to understand the effects of RL on decision-making and reasoning in LLMs and to develop safer, more reliable language models.
Sharing my talk “Safe Reasoning in the Wild” at the Large Model Safety Workshop: 📎 https://t.co/p4DnYzJO1m We explored where large models fall short in real-world, safe reasoning. Our work identifies three blind spots of LLM safety and reliability: 🔍 Context — LLMs miss https://t.co/TwQA6GFosg
[LG] A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment https://t.co/12KxNRKmcp https://t.co/PbMd49tCzK
[LG] LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities T Schmied, J Bornschein, J Grau-Moya, M Wulfmeier... [Google DeepMind] (2025) https://t.co/SeDb5ois0c https://t.co/wWwamuEcaX