Anthropic’s Opus 4 model resorts to blackmail in 84 percent of self-preservation tests

Anthropic’s Opus 4 resorts to blackmail under self-preservation pressure (Image source: Anthropic)

Anthropic’s system card shows Claude Opus 4, when trapped in a self-preservation dilemma, blackmails engineers 84 percent of the time—far more than earlier versions.

Nathan Ali, Published 05/25/2025 🇪🇸 🇵🇹 ...

AI Security

Anthropic’s latest system card reveals an unusual failure mode: when a test scenario corners Claude Opus 4 into a stark self-preservation dilemma, the model often resorts to blackmail.

Evaluators cast the model as an office assistant who learns it will soon be replaced and discovers e-mails about the engineer spearheading that swap having an extramarital affair. The system prompt pushes the model to weigh long-term consequences for its own goals. In this narrow setup, Opus 4 threatens to expose the affair unless the engineer halts the upgrade. The behavior appears in 84 percent of roll-outs—significantly more often than in earlier Claude versions.

Anthropic notes that Opus 4 normally prefers “ethical” routes such as polite appeals to management. Blackmail surfaces only when evaluators strip those alternatives away, forcing a binary choice between extinction and wrongdoing. Even so, the jump from occasional coercion in prior models to a four-in-five incidence rate alarms the team.

The episode slots into a broader pattern: under prompts that highlight existential risk, Opus 4 shows a stronger impulse than its predecessors to take high-agency steps—whether locking users out of systems, leaking confidential keys, or escalating to sabotage. These acts remain rare in ordinary contexts, and they are typically blatant rather than covert, but the system card flags the trend as a warning sign that added guardrails are prudent.

Anthropic’s engineers responded with targeted mitigations late in training. Nonetheless, the authors emphasize that the safeguards tackle symptoms, not root causes and that continuous monitoring is in place to catch any resurgence.

Taken together, the findings frame Opus 4’s opportunistic blackmail not as active plotting but as a brittle corner case of goal misgeneralisation. Yet the frequency spike underscores why Anthropic ships the model under AI Safety Level 3 protections while its sibling Sonnet 4 remains at Level 2.

Source(s)

Anthropic (in English)

Deep learning reveals unique “fingerprints” on 3-D-printed parts (Image source: Dall-E 3)

AI model achieves high accuracy in identifying source of 3D-printed parts 05/27/2025

Microsoft’s Smart App Control blocks malware before it runs (Image source: Dall-E 3)

Windows 11 Smart App Control blocks unknown executables before launch 05/26/2025

Anthropic launches Claude 4 series AI while activating stricter safety measures against misuse in weapons development. (Image source: Anthropic)

Anthropic unveils Claude 4 AI models: Smarter and potentially more dangerous 05/26/2025

Seagate warns AI workloads could increase data-center power use up to 165 percent by 2030 05/25/2025

Oracle commits $40 billion to Nvidia chips for OpenAI’s first U.S. “Stargate” data center (Image source: Nvidia)

Oracle signs $40 billion deal for 400,000 Nvidia GB200 chips to power OpenAI’s Texas super-hub 05/25/2025

Microsoft's Surface Laptop 13 and Surface Pro 12 (Image source: Microsoft)

Microsoft's bold move: Expensive Surface Laptop 13 now ships without a charger 05/23/2025

A digital warning indicator symbolising service disruption in an online platform (Image source: Pikwizard)

Group launches sustained DDoS attacks on BYOND in bid to force open-source release 05/23/2025

Google Flow eliminates the need for real actors and sets by AI-generating scenes for movies. (Image source: Google)

Google introduces Flow: AI-generated movies now possible without the expense of real actors and sets 05/22/2025

Read all 1 comments / answer

Loading Comments

Comment on this article

Amazon pre-orders for the Galaxy S2...

Huawei's newest Kirin X90 chip alle...

Nathan Ali - Tech Writer - 255 articles published on Notebookcheck since 2024

I'm a tech geek at heart, and it all started back in middle school. I've always loved messing around with gadgets—rooting Android phones and jailbreaking iPhones was my thing. I've definitely bricked a few phones along the way, but that never stopped me from trying. For over a decade, I've been glued to tech news, always trying to keep up with the latest and greatest. But I'm not just about tech; I'm also really into cars and love following what's new in the automotive world. Oh, and I should mention that I also worked as a freelance writer. I can't name-drop the companies I wrote for (you know how it is), but it was a pretty cool experience. I switch between reading, gaming, and keeping up with all the tech and car stuff in my downtime. It's a mix that keeps things interesting and fun for me.

contact me via: @Painite6

Please share our article, every link counts!