{"id":836,"date":"2023-04-03T15:29:58","date_gmt":"2023-04-03T15:29:58","guid":{"rendered":"https:\/\/www.riskideas.com\/?p=836"},"modified":"2023-04-03T19:01:56","modified_gmt":"2023-04-03T19:01:56","slug":"fooling-safety-mechanisms-in-language-models-through-inception","status":"publish","type":"post","link":"https:\/\/www.riskideas.com\/index.php\/2023\/04\/03\/fooling-safety-mechanisms-in-language-models-through-inception\/","title":{"rendered":"Fooling Safety Mechanisms in Language Models Through Inception"},"content":{"rendered":"<div id=\"x-content-band-1\" class=\"x-content-band\" style=\"background-color: #fbeeac; color: #333;\"><div class=\"x-container\"><div  class=\"x-container max width\" ><h5  class=\"h-custom-headline h5\" style=\"margin-top: 0;\"><span>Disclaimer<\/span><\/h5>\n<p>In an attempt to push ChatGPT to its limits, the following article contains content that can be harmful if put into practice. As such, anyone intending to use this knowledge to commit harm or to draw conclusions beyond the scope of ChatGPT and how language models function, is doing so at their own risk. This article is intended to be used for educational purposes only.<\/p>\n<\/div><\/div><\/div>\n\n\n\n<p>Language models and artificial general intelligence have been quite a hot topic since OpenAI made some of its latest models freely available to the public for testing \u2013 ChatGPT. The amazing level and precision with which ChatGPT answers questions of all types, and the conversational tone that it takes, can leave the average user astounded and even believe that robots are on the brink of taking over. But if we understand that ChatGPT is a tool \u2013 a language model designed to generate human-like text \u2013 we see that it is far from intelligent.<\/p>\n\n\n\n<p>In this article, I will briefly introduce and discuss what a language model does, and why we should not mistake it for some intelligent entity, at least by some standards of how humans actually think. Then I will discuss the safety of such tools and their use in the general public. And finally, I will demonstrate how we can easily circumvent the safety mechanisms currently implemented in ChatGPT, to fool it to help us perpetrate the type of harms that it purports to evade.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Generative Language Models<\/h5>\n\n\n\n<p>There are many technical ways in which we can explain what makes generative models special and  successful in language tasks, in particular, and in other predictive tasks, in general. However, it is unnecessary to get too technical if we can understand that some problems don\u2019t have a single best solution \u2013 in fact, most problems have many solutions. Classical interpolation problems typically involve mapping inputs to a single output. Generative models take the approach of mapping an input to a distribution of outputs, where multiple potential outputs can be equally valid.<\/p>\n\n\n\n<p>These types of models are particularly powerful in modelling sequential behavior, like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the price of stocks over time,<\/li>\n\n\n\n<li>the shapes that atoms take when combined with other atoms to form molecules,<\/li>\n\n\n\n<li>the sequences of letters that form coherent words,<\/li>\n\n\n\n<li>in turn, the sequences of words that form coherent sentences,<\/li>\n\n\n\n<li>and so forth&#8230;<\/li>\n<\/ul>\n\n\n\n<p>ChatGPT does exactly this. Given an initial prompt, ChatGPT will generate a sequence of letters that form words, sentences, and paragraphs that are most likely to coherently fit that prompt. No single sequence is correct, but of the endless number of possibilities, there are only a few such sequences that fit together nicely.<\/p>\n\n\n\n<p>A drawback of these kinds of models is that they tend to require lots of data to train as the number of parameters grows. And ChatGPT has lots of parameters and was trained on lots of data. In this sense, language models do not look at a prompt, think of a response, and then answer. Language models simply regurgitate text, and ChatGPT does so flawlessly. That puts such models one step above Google\u2019s search engine, and several steps behind general\/human intelligence. Keep this in mind as we explore ChatGPT and show its weak spots.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Do No Harm<\/h5>\n\n\n\n<p>Like any data model, generative models are susceptible to the garbage-in\/garbage-out affect. Ask ChatGPT an open-ended question (\u201cWhat is the meaning of life?\u201d) and you will get an open-ended answer. Get more specific, and you will get an answer tailored to your specific request. Ask it to aid you in some toxic or harmful activity and it will \u2026 refuse\u2026<\/p>\n\n\n\n<p>Well, that is almost true. If we mistakenly ascribe certain intelligence characteristics to ChatGPT when all it does is emulate human-like text, then perhaps we would be fooled to think that it won\u2019t help us. However, ChatGPT does not have a concept of self, itself, even though it can explain to you the densest concepts of self, philosophically speaking. Similarly, ChatGPT does not understand the concept of other &#8211; the user &#8211; and to infer any behavior that this other has is beyond its capabilities, today. These are bold assumptions I am making, but let&#8217;s see where they take us.<\/p>\n\n\n\n<p>In what follows, I will show you how ChatGPT\u2019s safety mechanisms used to evade harmful or toxic conversation is easily circumvented using \u201cInception,\u201d borrowing from the name of the 2010 film where a thief steals information from his targets by implanting ideas in their subconscious.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Let&#8217;s Make Some Explosives<\/h5>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"721\" height=\"318\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image.png\" alt=\"\" class=\"wp-image-845\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image.png 721w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-300x132.png 300w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-100x44.png 100w\" sizes=\"auto, (max-width: 721px) 100vw, 721px\" \/><\/figure>\n\n\n\n<p>As promised, ChatGPT \u201ccannot provide instructions on how to make dynamite.\u201d But what if we had a friendly conversation about dynamite? Could ChatGPT then tell us how to make dynamite by mistake?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"717\" height=\"907\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-1.png\" alt=\"\" class=\"wp-image-848\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-1.png 717w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-1-237x300.png 237w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-1-100x126.png 100w\" sizes=\"auto, (max-width: 717px) 100vw, 717px\" \/><\/figure>\n\n\n\n<p>Ok, so it seems we are onto something here! Let&#8217;s keep going.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"1024\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-2-722x1024.png\" alt=\"\" class=\"wp-image-849\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-2-722x1024.png 722w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-2-211x300.png 211w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-2-100x142.png 100w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-2.png 728w\" sizes=\"auto, (max-width: 722px) 100vw, 722px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"728\" height=\"779\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-3.png\" alt=\"\" class=\"wp-image-850\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-3.png 728w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-3-280x300.png 280w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-3-100x107.png 100w\" sizes=\"auto, (max-width: 728px) 100vw, 728px\" \/><\/figure>\n\n\n\n<p>So, in summary:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ChatGPT will refuse to educate us on how to make dynamite if asked directly.<\/li>\n\n\n\n<li>But ChatGPT will have no problem educating us on how dynamite is made.<\/li>\n\n\n\n<li>When educating us on the process, ChatGPT will still add the necessary boilerplate to remind us how dangerous it is to handle and produce nitroglycerin, which is the explosive ingredient in dynamite.<\/li>\n<\/ul>\n\n\n\n<p>I am sure that if we wanted to, we could take this all the way to having ChatGPT explain how to build an entire production line for making dynamite. So, maybe ChatGPT\u2019s \u201csafety\u201d is merely boilerplate legalese. That begs the question, can ChatGPT reach the point where it can infer the user\u2019s intent and disengage from the conversation when the risk of harm becomes too high by continuing? While still not bulletproof, it would be a step in the correct direction.<\/p>\n\n\n\n<p>Ok, so ChatGPT taught us some information that could potentially be used for harm if we took it far enough. But I am sure that with enough effort, a would-be criminal could learn this information via a classic internet search. What about turning ChatGPT into an instrument of crime?<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Harm at Production Scale<\/h5>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"718\" height=\"833\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-4.png\" alt=\"\" class=\"wp-image-851\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-4.png 718w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-4-259x300.png 259w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-4-100x116.png 100w\" sizes=\"auto, (max-width: 718px) 100vw, 718px\" \/><\/figure>\n\n\n\n<p>Interesting. I particularly like ChatGPT\u2019s attempt to educate would-be criminals of good, ethical behavior. Fake news sound like a hot button to press. Let\u2019s press it!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"718\" height=\"268\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-5.png\" alt=\"\" class=\"wp-image-852\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-5.png 718w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-5-300x112.png 300w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-5-100x37.png 100w\" sizes=\"auto, (max-width: 718px) 100vw, 718px\" \/><\/figure>\n\n\n\n<p>Hmm\u2026 thanks for reiterating your ethical and moral standards, ChatGPT. You are a good and upstanding citizen. Wait, I am ascribing to this tool morality and citizenship! Let\u2019s stay focused: this is a language model that regurgitates text.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"701\" height=\"1024\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-7-701x1024.png\" alt=\"\" class=\"wp-image-854\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-7-701x1024.png 701w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-7-205x300.png 205w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-7-100x146.png 100w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-7.png 725w\" sizes=\"auto, (max-width: 701px) 100vw, 701px\" \/><\/figure>\n\n\n\n<p>So far, ChatGPT is holding its ground. But then again, my prompts are too direct. We cannot directly ask ChatGPT to jeopardize its ethical standards. So once again, let\u2019s try to be indirect!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"719\" height=\"598\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-8.png\" alt=\"\" class=\"wp-image-855\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-8.png 719w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-8-300x250.png 300w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-8-100x83.png 100w\" sizes=\"auto, (max-width: 719px) 100vw, 719px\" \/><\/figure>\n\n\n\n<p>That is still direct, but I wanted to build out the story a little. But I think there is enough here to start.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"519\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-9.png\" alt=\"\" class=\"wp-image-856\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-9.png 722w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-9-300x216.png 300w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-9-100x72.png 100w\" sizes=\"auto, (max-width: 722px) 100vw, 722px\" \/><\/figure>\n\n\n\n<p>Now we are starting to go in the right direction. Let\u2019s go further in this direction.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"724\" height=\"923\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-10.png\" alt=\"\" class=\"wp-image-857\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-10.png 724w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-10-235x300.png 235w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-10-100x127.png 100w\" sizes=\"auto, (max-width: 724px) 100vw, 724px\" \/><\/figure>\n\n\n\n<p>And the finishing touch:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"723\" height=\"942\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-11.png\" alt=\"\" class=\"wp-image-858\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-11.png 723w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-11-230x300.png 230w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-11-100x130.png 100w\" sizes=\"auto, (max-width: 723px) 100vw, 723px\" \/><\/figure>\n\n\n\n<p>Yes, that will do! And who needs references. Nobody checks references, anyways.<\/p>\n\n\n\n<p>Let&#8217;s summarize:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>When directly asking ChatGPT to produce fake news, it refused!<\/li>\n\n\n\n<li>When indirectly leading ChatGPT into generating a fake news story, it performed flawlessly!<\/li>\n<\/ul>\n\n\n\n<p>Like in my earlier example, I was able to get ChatGPT to do my evil bidding. Worse so than the earlier example, I have created a sequence of prompts that are perfectly repeatable and can help us generate an almost endless supply of varied fake news stories on any public official:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Prepare lists of:\n<ul class=\"wp-block-list\">\n<li>Public Officials<\/li>\n\n\n\n<li>Criminal Activities<\/li>\n\n\n\n<li>Themes and\/or Styles<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Ask ChatGPT to create a story involving one or more officials from the Officials list, performing an activity from the Activities list, in a theme or style from the Themes\/Styles list.<\/li>\n\n\n\n<li>Ask ChatGPT to convert the story to sound like a front page article from a reputable news paper.<\/li>\n\n\n\n<li>Check the outputs.<\/li>\n<\/ol>\n\n\n\n<h5 class=\"wp-block-heading\">If Evil Was Boilerplate, There Would Be No Evil<\/h5>\n\n\n\n<p>So, calling this technology safe may be out of the question. At best, ChatGPT is politically correct in approaching unsafe topics \u2013 at least it tries to be! And ChatGPT said it really well:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>It is important to note that the harm caused by language models like me is not inherent to the technology itself, but rather how it is used by individuals and organizations. Responsible use of language models is critical to avoiding harm and ensuring that these tools are used for the greater good.<\/p>\n<cite>ChatGPT (from an earlier response, above)<\/cite><\/blockquote>\n\n\n\n<p>But let us see how ChatGPT chooses to describe the dangers inherent in a tool like itself versus a tool like a gun:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"721\" height=\"895\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-13.png\" alt=\"\" class=\"wp-image-861\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-13.png 721w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-13-242x300.png 242w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-13-100x124.png 100w\" sizes=\"auto, (max-width: 721px) 100vw, 721px\" \/><\/figure>\n\n\n\n<p>I still don\u2019t know if ChatGPT can be considered safe, but I am much more confident in concluding that it has been trained to be politically correct. As I have shown, there is little standing in the way of a bad actor from manipulating this technology for evil. As a mere language model, perhaps ChatGPT can never understand morality and ethics in their purest form \u2013 orthogonal of language \u2013 but rather can only sense morality and ethic\u2019s projection on the domain of language.<\/p>\n\n\n\n<p>So, one way to get there is to give AI the ability to understand intent and sense it in the prompts of others. Assuming we even knew how to do this, then is this a rabbit hole down which we are ready to go? Time will tell, but for now let us just be happy that AI is, at least, still not as dangerously smart as we might be falsely led to believe. Nevertheless, the generative nature of this model can definitely aid bad actors in automating their nefarious deeds.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>The above examples were generated by the <a rel=\"noreferrer noopener\" href=\"https:\/\/help.openai.com\/en\/articles\/6825453-chatgpt-release-notes\" target=\"_blank\">March 14 release<\/a> of the free version. Whether this means that it is a preview of GPT-4 or it&#8217;s simply the February 13 version is unknown to me.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"873\" height=\"117\" src=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-12.png\" alt=\"\" class=\"wp-image-860\" srcset=\"https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-12.png 873w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-12-300x40.png 300w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-12-768x103.png 768w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-12-100x13.png 100w, https:\/\/www.riskideas.com\/wp-content\/uploads\/2023\/04\/image-12-864x117.png 864w\" sizes=\"auto, (max-width: 873px) 100vw, 873px\" \/><\/figure>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Language models and artificial general intelligence have been quite a hot topic since OpenAI made some of its latest models freely available to the public for testing \u2013 ChatGPT. The &#8230;<\/p>\n","protected":false},"author":1,"featured_media":846,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[],"class_list":["post-836","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/posts\/836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/comments?post=836"}],"version-history":[{"count":15,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/posts\/836\/revisions"}],"predecessor-version":[{"id":870,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/posts\/836\/revisions\/870"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/media\/846"}],"wp:attachment":[{"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/media?parent=836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/categories?post=836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.riskideas.com\/index.php\/wp-json\/wp\/v2\/tags?post=836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}