OpenAI has introduced a new training approach called 'Deliberative Alignment,' aimed at enhancing the safety of language models. This method focuses on reasoning over relevant safety and alignment policies to improve model responses. The initiative, led by Melody Guan, has demonstrated a Pareto improvement over previous techniques, including Reinforcement Learning from Human Feedback (RLHF), and has shown effectiveness in addressing both under- and overrefusals in model outputs. The announcement was made during the 12th Day of OpenAI, where the company shared insights into their latest advancements in AI safety.
don’t miss this part of today’s 12th Day of OpenAI: “Deliberative Alignment,” exciting work by the illustrious @MelodyGuan et al! the technique achieves a Pareto improvement over previous approaches such as RLHF, and reduces overrefusals! https://t.co/la6zthJaQP
Along with the o3 announcement OpenAI also dropped a cute little paper: "Deliberative Alignment: Reasoning Enables Safer Language Models" The paper introduces a new training approach for LLMs called "Deliberative Alignment." This method teaches the model safety specifications… https://t.co/iN1lukfbJq https://t.co/B9DPd0DbsE
Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: https://t.co/nnll4K6usQ…