Jun 5, 09:45 PM

Cohere Introduces SRPO: Self-Improving Robust Preference Optimization in 2024

In 2024, researchers E. Choi, A. Ahmadian, M. Geist, O. Pietquin, and M. G. Azar from Cohere have introduced a novel approach called Self-Improving Robust Preference Optimization (SRPO). This method trains a separate large language model (LLM) on a preference dataset to generate better responses based on given prompts and current LLM outputs. SRPO learns a self-improvement policy to revise suboptimal samples towards more preferred ones, optimizing alignment. This new approach aims to address the limitations of existing alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization, which rely heavily on extensive human annotation and lack transparency, limiting their scalability and adaptability. The new LLM is referred to as the in-context LLM.

#Cohere #Direct Preference Optimization

Written with ChatGPT (GPT-4o).

Sources

Additional media

Image #1 for story cohere-introduces-srpo-self-improving-robust-preference-optimization-2024

Image #2 for story cohere-introduces-srpo-self-improving-robust-preference-optimization-2024

Image #3 for story cohere-introduces-srpo-self-improving-robust-preference-optimization-2024

Cohere Introduces SRPO: Self-Improving Robust Preference Optimization in 2024

Sources

Additional media

Similar Stories