Anthropic, in collaboration with the U.S. Department of Energy’s National Nuclear Security Administration (NNSA), has developed a classifier designed to identify risky prompts related to nuclear technology, achieving a preliminary accuracy rate of 96%. This initiative is part of Anthropic’s broader research efforts to enhance AI safety by filtering out harmful information about chemical, biological, radiological, and nuclear (CBRN) weapons from the training data before model pretraining. Their approach has demonstrated a 33% relative reduction in harmful content detection without compromising performance on benign tasks such as science and coding. This method aims to reduce potential harm by preventing dangerous knowledge from being learned by AI models. The emphasis on safety as a foundational principle reflects a commitment to developing well-aligned AI systems that not only perform effectively but also protect against misuse. Additional research in AI safety includes parameter-efficient methods for persistent concept unlearning, such as CRISP, which ensures unwanted knowledge is thoroughly removed from large language models (LLMs).
MCP-Universe: Benchmarking LLMs with Real-World Model Context Protocol Servers - Evaluates 10 state-of-the-art LLMs on 133 real MCP tools across 6 domains; top model (GPT-5) achieves 43.72% success, with most models below 35% in complex domains - Reveals core limitations: long https://t.co/ClGt2CSEYT
Introducing CRISP: A new parameter-efficient method for persistent concept unlearning in LLMs, ensuring unwanted knowledge is truly removed. Safer, more reliable AI is here. https://t.co/TPzk5Sc4Xt
[LG] Do What? Teaching Vision-Language-Action Models to Reject the Impossible W Hsieh, E Hsieh, D Niu, T Darrell... [UC Berkeley] (2025) https://t.co/J57liC9C37 https://t.co/RAboWSrhji