Recent research highlights the effectiveness of the Best-of-N Jailbreaking algorithm, which demonstrates an attack success rate of 89% on GPT-4o and 78% on Claude 3.5 Sonnet. This jailbreaking technique employs methods such as random shuffling and capitalization to manipulate inputs, successfully eliciting harmful responses from AI systems across various modalities. Despite advancements in AI security, experts note that defending against jailbreaking remains a significant challenge. Current defenses fail even in narrow domains, indicating vulnerabilities in state-of-the-art AI systems. The findings were discussed at the AdvMLFrontiers conference, where researchers expressed the need for focused efforts to eliminate jailbreaks before addressing broader harmful behaviors.
We found it's quite challenging to defend against jailbreaks even in a single, narrow domain ("don't give bomb making instructions"). Excited about future work that focuses on eliminating jailbreaks on a well-scoped, single domain, before expanding out to general harmfulness https://t.co/xKalYXpK1W
TFW jailbreaking works in the wild https://t.co/EiTuitCBbm
🚨🛡️Jailbreak Defense in a Narrow Domain 🛡️🚨 Jailbreaking is easy. Defending is hard. Might defending against a single, narrow, undesirable behavior be easier? Even in this focused setting, all modern jailbreaking defenses fail 😱 Appearing at @AdvMLFrontiers (Oral) &…