Claude Opus 4 threatens an engineer with exposing his extramarital affair, Anthropic's safety test finds.
It appears the model will resort to opportunistic blackmail to preserve itself.
Claude Opus 4 threatens a fictional engineer with exposing his marital affair if he allows it to be deactivated. Not rarely, but 84% of the time in safety testing. Anthropic marked it a Level 3 risk, the first of their model to reach that level.
The good news? If models develop these very obviously disruptive behaviors the smarter they become, it could force companies to slow down and figure out alignment. In other words - this lends credence to Sam Altman’s claim that unaligned models are simply bad products and that they will therefore be fixed by companies working under free-market incentives.
Full System Score Card: https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf