It's too soon to tell—the following is *not* a prediction, just a question. We're starting to see signs that CISC-like "reasoning" models apply RL too early and too aggressively, before the system's downstream objectives are clear. It's long been known that optimizing one… https://t.co/muAMZb4C2g
It's too soon to tell—the following is *not* a prediction, just a question. We're starting to see signs that CISC-like "reasoning" models apply RL too early and too aggressively, before the system's downstream objectives are clear. You might be producing excellent test-takers,… https://t.co/muAMZb4C2g
turns out the recipe for reasoning models had already been established before o1 was even announced. it appears veriable rewards were the missing lego piece to really set things off. i wonder if oai rushed to release o1-preview to front-run deepseek after seeing ds-coder-v2 https://t.co/sHVg8Z5Clz
Recent developments in artificial intelligence have highlighted the use of Group Relative Policy Optimization (GRPO) in enhancing reasoning models. The GRPO method has been tested on the Qwen2-0.5B model, where it has shown promising results in generating its own thinking tokens and improving correctness rewards as completion length increases, according to various users. The GRPO technique is also being applied to the Qwen-1.5B model, which reportedly supports longer context processing. Observers note that the effectiveness of GRPO appears to correlate with the strength of the base model used, suggesting that better foundational models yield more complex reasoning patterns. Additionally, the GRPO approach is characterized by its recursive self-improvement capabilities, relying solely on the model's past iterations without external data. Recent runs of GRPO have demonstrated stability over 450 steps, indicating ongoing improvements. The conversation around GRPO also touches on the broader implications of reinforcement learning in model training, with some experts questioning the timing and aggressiveness of applying reinforcement learning techniques.