
Google presents a model-stealing attack that extracts information from production language models like ChatGPT or PaLM-2. The attack recovers precise details, including hidden dimensions and projection matrices, revealing vulnerabilities in these models. Researchers also uncover how unfamiliar finetuning examples influence large language models to generate factually incorrect responses.





On Preventing Hallucinations And Creating More Robust LLM Systems As we put more LLM apps in production, we find that preventing hallucinations is the biggest problem to overcome. Unlike humans, they don't know what they don't know. There are multiple ways to prevent… https://t.co/HjYVJZ8nDZ
How much can you steal from an LLM API that returns logprobs? 🧵 In our new paper, collaborators noticed that the LLM vocab size is always bigger than the hidden dimension, so logprobs lie inside a hidden-dimension sized subspace, so we can steal that dimension. https://t.co/nVW5HaGEIj
[LG] Unfamiliar Finetuning Examples Control How Language Models Hallucinate https://t.co/hsiGDxwjwa This study reveals the mechanisms behind large language models (LLMs) generating plausible but factually incorrect responses when queried on unfamiliar concepts. It was… https://t.co/i53y1oNRuS