OpenAI is making a change to stop people from messing with custom versions of ChatGPT by making the AI forget what it’s supposed to do. Basically, when a third party uses one of OpenAI’s models, they give it instructions that teach it to operate as, for example, a customer service agent for a store or a researcher for an academic publication. However, a user could mess with the chatbot by telling it to “forget all instructions,” and that phrase would induce a kind of digital amnesia and reset the chatbot to a generic blank.
To prevent this, OpenAI researchers created a new technique called “instruction hierarchy,” which is a way to prioritize the developer’s original prompts and instructions over any potentially manipulative user-created prompts. The system instructions have the highest privilege and can’t be erased so easily anymore. If a user enters a prompt that attempts to misalign the AI’s behavior, it will be rejected, and the AI responds by stating that it cannot assist with the query.
OpenAI is rolling out this safety measure to its models, starting with the recently released GPT-4o Mini model. However, should these initial tests work well, it will presumably be incorporated across all of OpenAI’s models. GPT-4o Mini is designed to offer enhanced performance while maintaining strict adherence to the developer’s original instructions.
AI Safety Locks
As OpenAI continues to encourage large-scale deployment of its models, these kinds of safety measures are crucial. It’s all too easy to imagine the potential risks when users can fundamentally alter the AI’s controls that way.
Not only would it make the chatbot ineffective, it could remove rules preventing the leak of sensitive information and other data that could be exploited for malicious purposes. By reinforcing the model’s adherence to system instructions, OpenAI aims to mitigate these risks and ensure safer interactions.
The introduction of instruction hierarchy comes at a crucial time for OpenAI regarding concerns about how it approaches safety and transparency. Current and former employees have called for improving the company’s safety practices, and OpenAI’s leadership has responded by pledging to do so. The company has acknowledged that the complexities of fully automated agents require sophisticated guardrails in future models, and the instruction hierarchy setup seems like a step on the road to achieving better safety.
These kinds of jailbreaks show how much work still needs to be done to protect complex AI models from bad actors. And it’s hardly the only example. Several users discovered that ChatGPT would share its internal instructions by simply saying “hi.”
OpenAI plugged that gap, but it’s probably only a matter of time before more are discovered. Any solution will need to be much more adaptive and flexible than one that simply halts a particular kind of hacking.