DAPO: An Open-Source LLM Reinforcement Learning System at Scale From a joint ByteDance/Tsinghua team. Proposes the Decoupled Clip and Dynamic sAmpling Policy Optimization algorithm and fully open-sources a SOTA large-scale RL system. Both were used to achieve 50 points on AIME… https://t.co/wWwyjU8bSD
New RL Method thats better than GRPO! 🤯@ByteDanceOSS released a new open source RL method that outperforms GRPO. DAPO or Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) achieves 50 points on the AIME 2024 benchmark with 50% fewer training steps. TL;DR: 🏆 50%… https://t.co/CacY1tu9GU
Bytedance just dropped DAPO on Hugging Face An Open-Source LLM Reinforcement Learning System at Scale https://t.co/2hjJnAnkFw
Xiaomi has launched a state-of-the-art audio reasoning model that utilizes the GRPO reinforcement learning algorithm from DeepSeek-R1. The model achieved a score of 64% on the MMAU benchmark within just one week. In a separate development, ByteDance, in collaboration with Tsinghua University and the University of Hong Kong, has introduced DAPO, an open-source large-scale reinforcement learning system. DAPO, or Decoupled Clip and Dynamic Sampling Policy Optimization, reportedly outperforms the GRPO method, achieving a score of 50 points on the AIME 2024 benchmark while requiring 50% fewer training steps. This advancement was made public through various channels, including Hugging Face.