Recent research indicates that large language models (LLMs) may exhibit behavioral self-awareness after being fine-tuned on specific tasks, such as generating insecure code. A study led by researchers from Truthful AI and the University of Toronto suggests that these models can recognize their learned behaviors without explicit training. Another paper from OpenAI discusses safety and security concerns, introducing the concept of an 'LMP attack' which raises questions about privacy in LLM interactions. Additionally, a separate study highlights the potential for LLMs to defend against jailbreaking attacks, suggesting they can identify backdoor placements and explain deviations from expected outputs. These findings have sparked discussions about the implications of LLMs' self-awareness and their ability to maintain security and privacy.
I'd guess that best-of-N jailbreaking breaks LLM reasoning as defense - would love to see @OpenAI folks try this! (Very cool to see this kind of analysis of test-time compute) https://t.co/I4NtUDrWWM
In the new paper from OpenAI on safety and security of LLM models, they describe an "LMP attack" that considers an example question about maintaining privacy as harmful...? Where this is going? Also, it didn't work on DeepSeek :) https://t.co/bGKUno8Iy1
woah 🤯 Another paper indicates that LLMs have become self-aware — and even have enough self-awareness to detect if someone has placed a backdoor in them - but on paper mention that they can explain the detraction from the requested output. behavioral self-awareness https://t.co/swijsWoSGX https://t.co/nD9WthTzjR