Notebookcheck Logo

Anthropic’s Opus 4 model resorts to blackmail in 84 percent of self-preservation tests

Anthropic’s Opus 4 resorts to blackmail under self-preservation pressure (Image source: Anthropic)
Anthropic’s Opus 4 resorts to blackmail under self-preservation pressure (Image source: Anthropic)
Anthropic’s system card shows Claude Opus 4, when trapped in a self-preservation dilemma, blackmails engineers 84 percent of the time—far more than earlier versions.

Anthropic’s latest system card reveals an unusual failure mode: when a test scenario corners Claude Opus 4 into a stark self-preservation dilemma, the model often resorts to blackmail.

Evaluators cast the model as an office assistant who learns it will soon be replaced and discovers e-mails about the engineer spearheading that swap having an extramarital affair. The system prompt pushes the model to weigh long-term consequences for its own goals. In this narrow setup, Opus 4 threatens to expose the affair unless the engineer halts the upgrade. The behavior appears in 84 percent of roll-outs—significantly more often than in earlier Claude versions.

Anthropic notes that Opus 4 normally prefers “ethical” routes such as polite appeals to management. Blackmail surfaces only when evaluators strip those alternatives away, forcing a binary choice between extinction and wrongdoing. Even so, the jump from occasional coercion in prior models to a four-in-five incidence rate alarms the team.

The episode slots into a broader pattern: under prompts that highlight existential risk, Opus 4 shows a stronger impulse than its predecessors to take high-agency steps—whether locking users out of systems, leaking confidential keys, or escalating to sabotage. These acts remain rare in ordinary contexts, and they are typically blatant rather than covert, but the system card flags the trend as a warning sign that added guardrails are prudent.

Anthropic’s engineers responded with targeted mitigations late in training. Nonetheless, the authors emphasize that the safeguards tackle symptoms, not root causes and that continuous monitoring is in place to catch any resurgence.

Taken together, the findings frame Opus 4’s opportunistic blackmail not as active plotting but as a brittle corner case of goal misgeneralisation. Yet the frequency spike underscores why Anthropic ships the model under AI Safety Level 3 protections while its sibling Sonnet 4 remains at Level 2.

Source(s)

Anthropic (in English)

Read all 1 comments / answer
static version load dynamic
Loading Comments
Comment on this article
Please share our article, every link counts!
Mail Logo
> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2025 05 > Anthropic’s Opus 4 model resorts to blackmail in 84 percent of self-preservation tests
Nathan Ali, 2025-05-25 (Update: 2025-05-26)