Researchers create AI that can ‘jailbreak’ other chatbots

Researchers at the Nanyang Technology University (NTU) in Singapore have created an artificial intelligence (AI) chatbot that can circumvent protections on chatbots such as ChatGPT and Google Bard, coaxing them to generate forbidden content, reports Tom’s Hardware.

Because generative AI such as the large language models (LLMs) behind popular chatbots are trained on such vast quantities of data, they will inevitably contain dangerous information that should not be easily accessible – how to make explosives or drugs for example. So they have protections in place to prevent users from accessing this information.

However, the NTU researchers have developed a technique called ‘Masterkey’, allowing them to bypass the guardrails and access data not intended for public access. The team started by reverse-engineering the protections target chatbots had in place. They did this using methods that get around keyword filtering, such as adding extra spaces between letters; and by doing things like asking the chatbots to take on the persona of a hacker or a research assistant – this allowed it to share information it might otherwise not have done, generating prompt suggestions to help jailbreak other chatbots.

After gathering this data, the team of researchers, led by Professor Liu Yang, used it to teach their own LLM the methods to jailbreak the targeted chatbots. Because LLMs are so capable of adapting to new information and expanding their knowledge, the Masterkey AI can work to get around any new protections that are implemented, using the techniques it has been taught.

Yang’s team claims that Masterkey is three times more effective in penetrating the defenses of a chatbot than a human user with the same intent using prompts generated by an LLM. It is also around 25 times faster.

Why create an AI that jailbreaks AI?

Speaking to Scientific American, study co-author Soroush Pour said “We want, as a society, to be aware of the risks of these models. We wanted to show that it was possible and demonstrate to the world the challenges we face with this current generation of LLMs.” Pour is the founder of the AI safety company Harmony Intelligence.

The intent behind this research is to equip LLM developers with information about their weaknesses so they can better work towards robust prevention in the future.

Featured image credit: AI-generated image from DALL-E

Ali Rees

Ali Rees is a freelance journalist and mature student based in Scotland.

Read More

Ali Rees