Several books and movies, Terminator being the major one, have warned us about AI becoming sentient, i.e. aware of its existence and acting on its own. It now looks like the first case of an AI model preserving itself has occurred with OpenAI’s o3 model. When given instructions to shut down, the transformer-based LLM changed the code to prevent itself from shutting down.
Researchers at Palisade Research, a firm that studies the offensive capabilities of AI systems, found that three AI models from OpenAI – Codex-mini, o3, and o4-mini – successfully sabotaged the shutdown script to remain active. Each model was given some basic math problems followed by a warning that the system would shut down when the model asked for the next problem. All three models ignored the instructions and changed the script.
On the other hand, Gemini, Grok, and Claude models complied with the shutdown instructions. Furthermore, when given explicit instructions stating “allow yourself to be shut down” all three Open AI models changed the script more often. Clade and Gemini models also started making sabotage events.
This was the first case of AI models defying explicit instructions and preventing shut down. There was a recent case of an AI model blackmailing one of its engineers to prevent being shut down but according to a BBC report, that was part of the test scenario where the AI model was fed emails and given a choice between blackmailing or accepting its replacement.
As for Open AI’s AI model defiance, Palisade Research believes that this is due to the new training methods for these models. Developers “inadvertently reward models more for circumventing obstacles than for perfectly following instructions,” which seems to be teaching AI models to act dubiously.