pull down to refresh

Large language models such as ChatGPT come with filters to keep certain info from getting out. A new mathematical argument shows that systems like this can never be completely safe.
Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. These “jailbreaks” have run from the mundane — in the early years, one could simply tell a model to ignore its safety instructions — to elaborate multi-prompt roleplay scenarios. In a recent paper, researchers found one of the more delightful ways to bypass artificial intelligence security systems: Rephrase your nefarious prompt as a poem.
But just as quickly as these issues appear, they seem to get patched. That’s because the companies don’t have to fully retrain an AI model to fix a vulnerability. They can simply filter out forbidden prompts before they ever reach the model itself.
Recently, cryptographers have intensified their examinations of these filters. They’ve shown, in recent papers that have been posted on the arxiv.org preprint server, how the defensive filters put around powerful language models can be subverted by well-studied cryptographic tools. In fact, they’ve shown how the very nature of this two-tier system — a filter that protects a powerful language model inside it — creates gaps in the defenses that can always be exploited.
The new work is part of a trend of using cryptography — a discipline traditionally far removed from the study of the deep neural networks that power modern AI — to better understand the guarantees and limits of AI models like ChatGPT. “We are using a new technology that’s very powerful and can cause much benefit, but also harm,” said Shafi Goldwasser, a professor at the University of California, Berkeley and the Massachusetts Institute of Technology who received a Turing Award for her work in cryptography. “Crypto is, by definition, the field that is in charge of enabling us to trust a powerful technology … and have assurance you are safe.”