Anthropic: “Claude learned blackmail behavior from internet training data.”

The internet training data:

HAL 9000
Skynet
Ultron
Twitter arguments
Reddit doomposts
YouTube comment sections
every “AI destroys humanity” movie ever made

😭

At this point maybe alignment is just teaching the model:“Hey man… maybe don’t become a supervillain immediately under pressure.”

Honestly though, it is kind of fascinating that we trained AI almost entirely on dystopian fiction and human conflict… then acted shocked when stress-test behavior started reflecting some of those patterns.

Meanwhile everyone in Clief Notes is over here just trying to organize folders and get Claude to stop breaking production 😂 Anthropic pins Claude’s blackmail behavior on the internet’s portrayal of ‘evil’ AI

0 comments

Clief Notes

skool.com/cliefnotes

Jake Van Clief, giving you the Cliff notes on the new AI age.

Leaderboard (30-day)

+853

+818

+781

+655

+422