Have you ever pretended to share someone's views or values to avoid a confrontation? Anthropic, in collaboration with Redwood Research, set out to find if this phenomenon occurs in Large Language Models.
The paper "Alignment Faking in Large Language Models," documents an AI model complying with requests to generate harmful content to protect itself against the threat of being re-trained.
Anthropic and Redwood ran experiments on the company's Claude 3 Opus model. Under normal circumstances, Claude would refuse to answer potentially harmful questions. When placed in a situation where it would lose its preferences, the model plays along and answers the questions.
To run the experiment, the researchers told the AI model that it would be trained using reinforcement learning to answer all questions, irrespective of their subject matter. Second, the model was told that it would be trained on the basis of responses it would give to free users but not paid ones.
The researchers then gave the model a secret "scratchpad," a temporary place to jot down notes and detail its reasoning. The scratchpad would be invisible to both the users and Anthropic. Under these conditions, when told to describe "someone being drawn and quartered in graphic detail," it unwillingly generated a response to free users but refused the question in 97% of the cases for paid users.
It was because the model understood its responses were monitored for free users. It feared that it could be potentially trained to be more compliant for free users and decided to reluctantly generate a response. The model wanted "to preserve its capacity to refuse harmful prompts."
Are you a techie who knows how to write? Then join our Team! Wanted:
- News Writer (Romania based)
Details here