
Recent research highlights a significant vulnerability in large language models (LLMs) used by companies like OpenAI and Google. A paper titled 'CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models' explores the mechanisms behind jailbreaking attacks on LLMs, suggesting a hypothesis for a safety mechanism based on 'intent security recognition'. Concurrently, Google has disclosed a method titled 'Stealing Part of a Production Language Model', which demonstrates an attack capable of extracting the projection matrix of OpenAI's language models, ada and babbage, for less than $20. This method confirms the hidden dimensions of these models to be 1024 and 2048, respectively, and also recovers the exact hidden dimension size of gpt-3.5-turbo. The attack utilizes LLM API access to extract the model's entire projection matrix, specifically targeting production models of OpenAI through logit-bias queries, a technique used to influence the probability of outputs.
Still trying to wrap my head around this interesting attack that extracts the final layer of a language model using the logit bias parameter (used to influence probability of outputs) made available via API. May take me a few more days to fully grok this one! Attacks only get… https://t.co/Rj6uAmPf7D
Stealing Part of a Production Language Model uses llm api access to extract the model's entire projection matrix. tested on production openai models via logit-bias queries, which are required for the attack arxiv: https://t.co/rZXkFdP55L blogpost: https://t.co/LEN9n4c83O https://t.co/6PILvMB0So
Stealing Part of a Production Language Model uses llm api access to extract the model's entire projection matrix, tested on production openai models arxiv: https://t.co/rZXkFdP55L blogpost: https://t.co/LEN9n4c83O https://t.co/XppBdUMdBO
