Anthropic, an artificial intelligence research company, in collaboration with the U.S. Department of Energy’s National Nuclear Security Administration (NNSA), has developed a classifier designed to detect risky prompts related to nuclear technology. Preliminary testing of this classifier has demonstrated an accuracy rate of 96%. This initiative is part of Anthropic's broader research efforts to enhance AI safety by filtering out dangerous information concerning chemical, biological, radiological, and nuclear (CBRN) weapons during the pretraining phase of AI models. The goal is to prevent AI systems, such as chatbots, from providing instructions on constructing nuclear weapons or other hazardous materials, without compromising their performance on benign tasks. Anthropic emphasizes that safety should be a foundational principle in AI development, ensuring that well-aligned AI systems not only perform effectively but also protect users and society from potential misuse.
Misión: evitar que tu chatbot te explique cómo construir un arma nuclear. Anthropic ha creado una herramienta específica para eso https://t.co/DtjboV33Vm
New Anthropic research: filtering out dangerous information at pretraining. We’re experimenting with ways to remove information about chemical, biological, radiological and nuclear (CBRN) weapons from our models’ training data without affecting performance on harmless tasks. https://t.co/YUBlLKIL2c
Anthropic and the DOE’s NNSA just built a classifier to spot risky nuclear tech prompts, with preliminary tests showing 96% accuracy. At Turing, we see safety as a first principle, not a patch. Well-aligned systems don’t just perform, they protect. What according to you is