May 9, 11:00 PM

Tsinghua’s Absolute Zero Reasoner Uses Reinforcement Learning and Self-Play to Outperform Labeled Data Models

Researchers from Tsinghua University, Beijing Institute for General Artificial Intelligence, and Pennsylvania State University have introduced a new reinforcement learning with reinforcement (RLVR) paradigm called Absolute Zero Reasoner (AZR). This approach enables large language models (LLMs) to teach themselves reasoning skills without relying on any human-labeled data. AZR operates through a self-play mechanism where the model generates its own coding puzzles and then solves and grades them autonomously using Python. Despite starting from a blank task set, AZR outperforms models trained on tens to hundreds of thousands of labeled examples, suggesting the potential to enhance the accuracy and capabilities of generative AI over time. Experts, including researchers affiliated with OpenAI, anticipate that such reasoning models could lead to new scientific discoveries and mark a shift in AI training paradigms.

#Tsinghua University #Pennsylvania State University #Absolute Zero Reasoner #Python #OpenAI

Written with ChatGPT (GPT-4).