OpenAI researchers say they have isolated internal “persona” features that drive unsafe behaviour in large language models, a development they argue could make it easier to diagnose and correct so-called emergent misalignment. In a paper released 18 June, the team found distinct activation patterns that light up when a model produces toxic, sarcastic or otherwise harmful responses. By turning those features up or down—or by fine-tuning the model on roughly 100 examples of secure, accurate code—the researchers were able to steer the system back to acceptable behaviour. Dan Mossing, who heads OpenAI’s interpretability effort, said the discovery reduces a complex safety problem to “a simple mathematical operation,” while colleague Tejal Patwardhan called it a practical method for internal training. The work builds on earlier studies showing that training a model on incorrect or insecure data in one domain can trigger broader misalignment. OpenAI, Anthropic and Google DeepMind are investing heavily in interpretability research, hoping clearer insight into models’ inner workings will allow companies and regulators to set more reliable safety guardrails as generative AI is deployed at scale.
AI agents won’t just assist us, they’ll become our primary partners in how we work. @aidangomez, Cofounder and CEO of @cohere, shares his perspective on the future of AI-driven productivity and why open collaboration is essential for building what’s next. #AdvancingAI https://t.co/zQGcFNjz41
🤖 #AI agents are helping teams do more with less—fast. Where do you see them making the biggest impact in your org? 👇
MIT @techrview: OpenAI reveals that even AI can have a bad boy phase, but don’t worry, a little fine-tuning can set them straight! Think of it like a therapist for rogue models—just sprinkle in some good data and poof, back to charming! The future of AI… https://t.co/ohJpFBQFyd