
In 2024, researchers E. Choi, A. Ahmadian, M. Geist, O. Pietquin, and M. G. Azar from Cohere have introduced a novel approach called Self-Improving Robust Preference Optimization (SRPO). This method trains a separate large language model (LLM) on a preference dataset to generate better responses based on given prompts and current LLM outputs. SRPO learns a self-improvement policy to revise suboptimal samples towards more preferred ones, optimizing alignment. This new approach aims to address the limitations of existing alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization, which rely heavily on extensive human annotation and lack transparency, limiting their scalability and adaptability. The new LLM is referred to as the in-context LLM.
Existing alignment approaches, like Reinforcement Learning from Human Feedback or Direct Preference Optimization, rely heavily on extensive human annotation and lack transparency in enforcing behaviors, limiting scalability and adaptability. SelfControl is a gradient-based… https://t.co/m5eoP7XIsd
No-Regret Algorithms for Safe Bayesian Optimization with Monotonicity Constraints. https://t.co/cOJxvJSjsE
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. https://t.co/1yaole5LYE


