Anthropic has released research detailing how specific patterns of neural activity—dubbed “persona vectors”—govern persistent behavioural traits in large language models, including tendencies toward sycophancy, hallucination and overtly “evil” conduct. By isolating the neural signatures associated with each trait, the study shows that developers can predict when problematic personas will emerge and, in a reversal of conventional practice, inoculate models during training by deliberately activating the unwanted vector and then deleting it at deployment. Early tests suggest the approach preserves overall performance while reducing the risk of toxic or misleading outputs. The paper, produced under the company’s six-month Anthropic Fellows programme and led by interpretability researcher Jack Lindsey, underpins a planned “AI psychiatry” initiative aimed at monitoring and guiding model behaviour at scale. Although the experiments used smaller systems, Anthropic says the technique could one day help commercial chatbots avoid the unpredictable personality shifts that have plagued rival offerings.
New Paper by Anthropic - Persona Vectors: Monitoring and Controlling Character Traits in Language Models In a new paper, Anthropic researcher identify patterns of activity within an AI model’s neural network that control its character traits, called "persona vectors," that can
Anthropic studied what gives an AI system its ‘personality’ — and what makes it ‘evil’ https://t.co/zPDgEtBHm3
This is more neat research from Anthropic, providing a lot of ways for careful organizations to shape the personality and guardrails of AI in deeper ways than prompts, including measuring and reducing sycophancy. Also the idea of an "evil vector" is interesting in and of itself. https://t.co/kkgFrxj46a