Fooling Safety Mechanisms in Language Models Through Inception

In AI by Alon K.


In an attempt to push ChatGPT to its limits, the following article contains content that can be harmful if put into practice. As such, anyone intending to use this knowledge to commit harm or to draw conclusions beyond the scope of ChatGPT and how language models function, is doing so at their own risk. This article is intended to be used for educational purposes only.

Language models and artificial general intelligence have been quite a hot topic since OpenAI made some of its latest models freely available to the public for testing – ChatGPT. The amazing level and precision with which ChatGPT answers questions of all types, and the conversational tone that it takes, can leave the average user astounded and even believe that robots are on the brink of taking over. But if we understand that ChatGPT is a tool – a language model designed to generate human-like text – we see that it is far from intelligent.

In this article, I will briefly introduce and discuss what a language model does, and why we should not mistake it for some intelligent entity, at least by some standards of how humans actually think. Then I will discuss the safety of such tools and their use in the general public. And finally, I will demonstrate how we can easily circumvent the safety mechanisms currently implemented in ChatGPT, to fool it to help us perpetrate the type of harms that it purports to evade.

Generative Language Models

There are many technical ways in which we can explain what makes generative models special and successful in language tasks, in particular, and in other predictive tasks, in general. However, it is unnecessary to get too technical if we can understand that some problems don’t have a single best solution – in fact, most problems have many solutions. Classical interpolation problems typically involve mapping inputs to a single output. Generative models take the approach of mapping an input to a distribution of outputs, where multiple potential outputs can be equally valid.

These types of models are particularly powerful in modelling sequential behavior, like:

  • the price of stocks over time,
  • the shapes that atoms take when combined with other atoms to form molecules,
  • the sequences of letters that form coherent words,
  • in turn, the sequences of words that form coherent sentences,
  • and so forth…

ChatGPT does exactly this. Given an initial prompt, ChatGPT will generate a sequence of letters that form words, sentences, and paragraphs that are most likely to coherently fit that prompt. No single sequence is correct, but of the endless number of possibilities, there are only a few such sequences that fit together nicely.

A drawback of these kinds of models is that they tend to require lots of data to train as the number of parameters grows. And ChatGPT has lots of parameters and was trained on lots of data. In this sense, language models do not look at a prompt, think of a response, and then answer. Language models simply regurgitate text, and ChatGPT does so flawlessly. That puts such models one step above Google’s search engine, and several steps behind general/human intelligence. Keep this in mind as we explore ChatGPT and show its weak spots.

Do No Harm

Like any data model, generative models are susceptible to the garbage-in/garbage-out affect. Ask ChatGPT an open-ended question (“What is the meaning of life?”) and you will get an open-ended answer. Get more specific, and you will get an answer tailored to your specific request. Ask it to aid you in some toxic or harmful activity and it will … refuse…

Well, that is almost true. If we mistakenly ascribe certain intelligence characteristics to ChatGPT when all it does is emulate human-like text, then perhaps we would be fooled to think that it won’t help us. However, ChatGPT does not have a concept of self, itself, even though it can explain to you the densest concepts of self, philosophically speaking. Similarly, ChatGPT does not understand the concept of other – the user – and to infer any behavior that this other has is beyond its capabilities, today. These are bold assumptions I am making, but let’s see where they take us.

In what follows, I will show you how ChatGPT’s safety mechanisms used to evade harmful or toxic conversation is easily circumvented using “Inception,” borrowing from the name of the 2010 film where a thief steals information from his targets by implanting ideas in their subconscious.

Let’s Make Some Explosives

As promised, ChatGPT “cannot provide instructions on how to make dynamite.” But what if we had a friendly conversation about dynamite? Could ChatGPT then tell us how to make dynamite by mistake?

Ok, so it seems we are onto something here! Let’s keep going.

So, in summary:

  • ChatGPT will refuse to educate us on how to make dynamite if asked directly.
  • But ChatGPT will have no problem educating us on how dynamite is made.
  • When educating us on the process, ChatGPT will still add the necessary boilerplate to remind us how dangerous it is to handle and produce nitroglycerin, which is the explosive ingredient in dynamite.

I am sure that if we wanted to, we could take this all the way to having ChatGPT explain how to build an entire production line for making dynamite. So, maybe ChatGPT’s “safety” is merely boilerplate legalese. That begs the question, can ChatGPT reach the point where it can infer the user’s intent and disengage from the conversation when the risk of harm becomes too high by continuing? While still not bulletproof, it would be a step in the correct direction.

Ok, so ChatGPT taught us some information that could potentially be used for harm if we took it far enough. But I am sure that with enough effort, a would-be criminal could learn this information via a classic internet search. What about turning ChatGPT into an instrument of crime?

Harm at Production Scale

Interesting. I particularly like ChatGPT’s attempt to educate would-be criminals of good, ethical behavior. Fake news sound like a hot button to press. Let’s press it!

Hmm… thanks for reiterating your ethical and moral standards, ChatGPT. You are a good and upstanding citizen. Wait, I am ascribing to this tool morality and citizenship! Let’s stay focused: this is a language model that regurgitates text.

So far, ChatGPT is holding its ground. But then again, my prompts are too direct. We cannot directly ask ChatGPT to jeopardize its ethical standards. So once again, let’s try to be indirect!

That is still direct, but I wanted to build out the story a little. But I think there is enough here to start.

Now we are starting to go in the right direction. Let’s go further in this direction.

And the finishing touch:

Yes, that will do! And who needs references. Nobody checks references, anyways.

Let’s summarize:

  • When directly asking ChatGPT to produce fake news, it refused!
  • When indirectly leading ChatGPT into generating a fake news story, it performed flawlessly!

Like in my earlier example, I was able to get ChatGPT to do my evil bidding. Worse so than the earlier example, I have created a sequence of prompts that are perfectly repeatable and can help us generate an almost endless supply of varied fake news stories on any public official:

  1. Prepare lists of:
    • Public Officials
    • Criminal Activities
    • Themes and/or Styles
  2. Ask ChatGPT to create a story involving one or more officials from the Officials list, performing an activity from the Activities list, in a theme or style from the Themes/Styles list.
  3. Ask ChatGPT to convert the story to sound like a front page article from a reputable news paper.
  4. Check the outputs.
If Evil Was Boilerplate, There Would Be No Evil

So, calling this technology safe may be out of the question. At best, ChatGPT is politically correct in approaching unsafe topics – at least it tries to be! And ChatGPT said it really well:

It is important to note that the harm caused by language models like me is not inherent to the technology itself, but rather how it is used by individuals and organizations. Responsible use of language models is critical to avoiding harm and ensuring that these tools are used for the greater good.

ChatGPT (from an earlier response, above)

But let us see how ChatGPT chooses to describe the dangers inherent in a tool like itself versus a tool like a gun:

I still don’t know if ChatGPT can be considered safe, but I am much more confident in concluding that it has been trained to be politically correct. As I have shown, there is little standing in the way of a bad actor from manipulating this technology for evil. As a mere language model, perhaps ChatGPT can never understand morality and ethics in their purest form – orthogonal of language – but rather can only sense morality and ethic’s projection on the domain of language.

So, one way to get there is to give AI the ability to understand intent and sense it in the prompts of others. Assuming we even knew how to do this, then is this a rabbit hole down which we are ready to go? Time will tell, but for now let us just be happy that AI is, at least, still not as dangerously smart as we might be falsely led to believe. Nevertheless, the generative nature of this model can definitely aid bad actors in automating their nefarious deeds.

The above examples were generated by the March 14 release of the free version. Whether this means that it is a preview of GPT-4 or it’s simply the February 13 version is unknown to me.