
Researchers at Anthropic have made progress in understanding the inner workings of LLMs, identifying millions of 'features' in Claude 3 that activate for concepts like San Francisco, lithium, and deception. Their work sheds light on the interpretability of AI models.
New Anthropic Research Sheds Light on AI's 'Black Box' https://t.co/hk4DRZMx35 https://t.co/Zva08fQ4L2
More awesome mechanistic interpretability from the team at Anthropic! https://t.co/8pYVpWcVSM https://t.co/2x2t9gnXYa
Hot take on a fascinating new paper on (partial) interpretability from @AnthropicAI: • The team was able to find (some) concept-like* “feature” representations for concepts ranging from the concrete to more abstract, from Golden Gate Bridge, to Secrecy, and Conflict of… https://t.co/I4NwxXcP5V


