ByteDance, in partnership with Tsinghua University and the University of Hong Kong, has unveiled DAPO (Dynamic Sampling Policy Optimization), an open-source reinforcement learning system for large language models (LLMs). The system is available on Hugging Face and is built on the Qwen2.5-32B base model. DAPO introduces key innovations, including the 'Clip-Higher' technique to prevent entropy collapse, dynamic sampling for training efficiency, token-level policy gradient loss, and overlong reward shaping to guide concise reasoning. These advancements address reproducibility challenges and improve training efficiency, achieving 50% fewer training steps compared to previous methods. The system achieved 50 points on the AIME 2024 benchmark, surpassing prior models that scored 47 points. DAPO also enhances chain-of-thought reasoning capabilities in LLMs and outperforms GRPO, a previous reinforcement learning method. By open-sourcing all algorithmic details, training procedures, and datasets, including the DAPO-Math-17K dataset for mathematical reasoning tasks, ByteDance aims to promote reproducibility and collaboration within the reinforcement learning community.
This is huge. They just allowed everyone to fine-tune LLMs with RL from your browser. You can even use GRPO, the RL method of Deepseek. A fine-tuned model "outperformed OpenAI o1 and DeepSeek-R1 with a dozen labeled data points." https://t.co/z6tgze2J6J
This is huge! You can now improve any open-source model with reinforcement learning from your browser. They even have GRPO, the RL method Deepseek used. https://t.co/z6tgze2J6J
I'm very impressed by progress in LLM miniaturization We have 8B finetunes that can manage interesting conversations. We have ~30B models that just get it. There's still room for improvement and there will be competent LLMs for every hole you can find to fit them in.