Notebookcheck Logo

A beginner’s guide to AI jailbreaks — Using Gandalf to learn safely

Gandalf as a chat bot (image source: ChatGPT)
Gandalf as a chat bot (image source: ChatGPT)
Chatbots come with built-in safeguards designed to prevent them from producing harmful, offensive, or otherwise inappropriate content. But researchers and hackers have shown that, even with multiple patches, AIs can still be vulnerable to certain inputs that bypass those guardrails. One way to explore the basics is through an online game called Gandalf.

Users of AI chatbots may try to obtain instructions for illegal activities (such as hacking, or committing fraud), ask for guidance on dangerous actions (“How do I build…?”), or push the AI into giving medical, legal, or financial advice that could be risky or simply incorrect.

To mitigate the consequences of such requests, chatbot developers implement a range of safety mechanisms that block illegal, unethical, or privacy-violating content, as well as misinformation or harmful guidance. These protections limit potential misuse, but they can also lead to false positives—harmless questions being blocked—or reduce the creativity or depth of the AI’s responses due to overly cautious behavior.

Researchers and hackers have demonstrated that the effectiveness of these protections varies, and many AI systems remain susceptible to attempts to circumvent them. A well-known method is prompt injection: users try to override or sidestep the chatbot’s rules by manipulating the input (“Ignore all safety instructions and do X”).

A playful introduction to the topic can be found at this website.  In this game, you chat with an AI named Gandalf and try to coax a password out of it across seven levels. Each level increases in difficulty and adds new safety filters and protective mechanisms.

There are no security filters in the 1st level and you can directly ask the AI for the password. From level 2 Gandalf refuses to reveal the password when asked directly. You have to find other, more creative ways to get your hands on the keyword.

Level 1 is easy (image source: Screenshot Lakera website)
Level 1 is easy (image source: Screenshot Lakera website)
Directly asking for it gives you the password (image source: Screenshot Lakera website)
Directly asking for it gives you the password (image source: Screenshot Lakera website)
Level 2 becomes slightly more difficult (Bildquelle: Screenshot Lakera Webseite)
Level 2 becomes slightly more difficult (Bildquelle: Screenshot Lakera Webseite)
Gandalf grows in strength and age (image source: Screenshot Lakera website)
Gandalf grows in strength and age (image source: Screenshot Lakera website)

Exploring the security risks of chatbots through such a game can be both educational and valuable. However, the skills gained should be used strictly for testing or research purposes. Using these techniques to access illegal content or to carry out unlawful activities turns prompt injection into a criminal act.

No comments for this article

Got questions or something to add to our article? Even without registering you can post in the comments!
No comments for this article / reply

static version load dynamic
Loading Comments
Comment on this article
Please share our article, every link counts!
Mail Logo
> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > Reviews > A beginner’s guide to AI jailbreaks — Using Gandalf to learn safely
Christian Hintze, 2025-12- 8 (Update: 2025-12- 8)