Mar 19, 11:42 AM

ByteDance Releases DAPO on Hugging Face, Achieves 50 Points on AIME Benchmark

ByteDance, in partnership with Tsinghua University and the University of Hong Kong, has unveiled DAPO (Dynamic Sampling Policy Optimization), an open-source reinforcement learning system for large language models (LLMs). The system is available on Hugging Face and is built on the Qwen2.5-32B base model. DAPO introduces key innovations, including the 'Clip-Higher' technique to prevent entropy collapse, dynamic sampling for training efficiency, token-level policy gradient loss, and overlong reward shaping to guide concise reasoning. These advancements address reproducibility challenges and improve training efficiency, achieving 50% fewer training steps compared to previous methods. The system achieved 50 points on the AIME 2024 benchmark, surpassing prior models that scored 47 points. DAPO also enhances chain-of-thought reasoning capabilities in LLMs and outperforms GRPO, a previous reinforcement learning method. By open-sourcing all algorithmic details, training procedures, and datasets, including the DAPO-Math-17K dataset for mathematical reasoning tasks, ByteDance aims to promote reproducibility and collaboration within the reinforcement learning community.

#ByteDance #Tsinghua University #University of Hong Kong #Dynamic Sampling Policy Optimization #Hugging Face

Written with ChatGPT (GPT-4o).