Anthropic has released new research evaluating 'feature steering,' a method aimed at modifying the behavior of large language models (LLMs) to mitigate social biases without degrading performance. The study builds on previous findings, particularly from the May release of Golden Gate Claude, which exhibited a bias towards the Golden Gate Bridge due to feature steering. The latest research indicates that adjusting specific parameters can alter political bias without the need for extensive fine-tuning. Researchers identified a 'sweet spot' for maintaining model capabilities while achieving bias reduction. However, results have been mixed, with some targeted and off-target effects observed. The study emphasizes the importance of balancing safety and controllability in AI models, raising questions about how different human perspectives might influence this balance. Notably, the anticipated 'smartness' feature for Claude 3.5 was not included in the latest findings, prompting speculation about its development.
Anthropic Finds That "Feature Steering" Helps Control Bias Without Degrading Performance. Anthropic artificially dialed up and down various features to see if they intuitively changed model outputs. Previously, they found that turning up a feature that responded to mentions of… https://t.co/rB5orotQFr
Evaluating feature steering: A case study in mitigating social biases by @AnthropicAI research Summary This research paper from Anthropic explores the potential of feature steering as a method to modify the behavior of large language models, specifically focusing on mitigating… https://t.co/9h55qxosMo
How well does feature steering work? Here's research from the societal impacts and interp team at Anthropic where we study this in more detail - the results are quite mixed, so we've shared what we've found. One intuitive finding: there's a 'sweet spot' for feature steering. https://t.co/G6UZTIKef7