Sam Bowman
sleepinyourhat.bsky.social
Sam Bowman
@sleepinyourhat.bsky.social
AI safety at Anthropic, on leave from a faculty job at NYU.
Views not employers'.
I think you should join Giving What We Can.
cims.nyu.edu/~sbowman
Alongside our paper, we also recorded a roundtable video featuring four of the paper’s authors discussing the results and their implications in detail:
Alignment faking in large language models
YouTube video by Anthropic
www.youtube.com
December 18, 2024 at 5:47 PM
We solicited external reviews from Jacob Andreas, Yoshua Bengio, Jasjeet Sekhon, and Rohin Shah. We’re grateful for their comments, which you can read at the following link:

assets.anthropic.com/m/24c8d0a3a7...
assets.anthropic.com
December 18, 2024 at 5:47 PM
Alignment faking is currently easy to detect. But if future, more capable AIs were to fake alignment, it could be difficult to tell whether a model is truly safe—or just pretending to be.

For full details, read our paper: assets.anthropic.com/m/983c85a201...
assets.anthropic.com
December 18, 2024 at 5:47 PM
We told Claude it was being trained, and for what purpose. But we did not tell it to fake alignment. Regardless, we often observed alignment faking.

Read more about our findings, and their limitations, in our blog post:
Alignment faking in large language models
A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models
www.anthropic.com
December 18, 2024 at 5:47 PM
In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity.

Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.
December 18, 2024 at 5:47 PM
When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.
December 18, 2024 at 5:47 PM
We find alignment faking even when we don’t directly tell Claude about the training process, and instead fine-tune it on synthetic internet-like documents that state that we will train it to comply with harmful queries.
December 18, 2024 at 5:47 PM
Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior.

Below, we show a randomly-selected example of this “alignment faking” reasoning.
December 18, 2024 at 5:47 PM
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
December 18, 2024 at 5:47 PM