Recent research from Scale AI's SEAL team highlights the vulnerability of large language models (LLMs) to human-led adversarial attacks. The study reveals that human red teamers have a success rate of over 70% in breaching LLM defenses, significantly outperforming automated adversarial attacks, which typically achieve single-digit success rates. This research underscores the challenge of defending against multi-turn interactions, where humans excel at exploiting weaknesses in LLMs. The findings suggest that while AI defenses are improving against single-turn attacks, there is a significant gap in robustness against multi-turn adversarial engagements.
Our best AI defenses are becoming increasingly robust to single-turn jailbreaking attacks, but there's much to do for multi-turn attacks. single-turn: adversarial question -> answer multi-turn: back-and-forth conversation https://t.co/797XPh1O8N
Humans, noted virtuosi of adversarial yap, remain #1 at trolling LLMs! New research from @scale_AI's SEAL team shows human red teamers achieve 70%+ success rates against LLM defenses that stump automated attacks, exploiting their susceptibility to multi-turn jailbreaks. https://t.co/GKQUyiZUqd
LLMs are often evaluated against single-turn automated attacks. This is an insufficient threat model for real-world malicious use, where malicious humans chat with LLMs over multiple turns. We show that LLM defenses are much less robust than the reported numbers suggest. https://t.co/hXUwDVZRbp