Apr 23, 04:53 AM

Tsinghua, Shanghai AI Lab Unveil Test-Time Reinforcement Learning Boosting Qwen-2.5-Math-7B by 159% on AIME 2024

Researchers from Tsinghua University and Shanghai AI Lab have introduced a novel approach called Test-Time Reinforcement Learning (TTRL) that enables large language models (LLMs) to learn and improve using unlabeled data during inference. This method leverages the priors of pre-trained models and employs a majority voting reward function based on consensus among model predictions, allowing LLMs to self-evolve without access to ground-truth labels. TTRL has demonstrated notable performance improvements, such as a 159% increase in the Qwen-2.5-Math-7B model's pass@1 score on the AIME 2024 benchmark. Meanwhile, other research indicates that reinforcement learning (RL) fine-tuning in LLMs does not necessarily enhance reasoning capacity beyond the base model but rather improves sampling efficiency. Distillation methods, in contrast, appear to expand reasoning capabilities more effectively. Additional studies from institutions including the University of Michigan, UC Berkeley, MIT, Carnegie Mellon University, New York University, Google Research, Duke University, Together AI, and Google DeepMind have explored various aspects of LLM training, reasoning, safety, and efficiency. These investigations highlight ongoing efforts to understand the effects of RL on decision-making and reasoning in LLMs and to develop safer, more reliable language models.