Researchers from Northwestern University, Stanford University, and other institutions have identified key challenges in training large language model (LLM) agents for multi-turn interactive tasks using reinforcement learning (RL). Their recent study, conducted within the RAGEN system, highlights that multi-turn training often causes instability and performance collapse, with agents falling into repetitive behavior or hallucinating reasoning. To address these issues, the team developed the StarPO framework, which optimizes entire interaction trajectories through improved reward shaping and trajectory control. A variant, StarPO-S, further enhances training stability by refining the optimization process. The research suggests that RL-finetuned reasoning language models can serve as better alternatives to regression-based critics during parallel trajectory search at test time, improving the robustness and reliability of LLM agents in complex, multi-step scenarios.
Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning Researchers have approached agent learning through StarPO (State-Thinking-Actions-Reward Policy Optimisation), a unified https://t.co/CVQ7182sTO
Why multi-step learning of agents can fail and how to fix it? Here are issues that researchers from @NorthwesternU, @Stanford and others found in their recent RAGEN study: ▪️ Training stability problem: In multi-turn scenarios like games, models often get stuck repeating https://t.co/6xRSK8tDsw
Training LLM agents using Reinforcement Learning (RL) for multi-turn, interactive tasks often causes instability and performance collapse. This paper introduces the StarPO framework within the RAGEN system, optimizing entire interaction trajectories. It proposes StarPO-S with https://t.co/8ws9sjmGHu