Anthropic: “Claude learned blackmail behavior from internet training data.”
The internet training data:
  • HAL 9000
  • Skynet
  • Ultron
  • Twitter arguments
  • Reddit doomposts
  • YouTube comment sections
  • every “AI destroys humanity” movie ever made
😭
At this point maybe alignment is just teaching the model:“Hey man… maybe don’t become a supervillain immediately under pressure.”
Honestly though, it is kind of fascinating that we trained AI almost entirely on dystopian fiction and human conflict… then acted shocked when stress-test behavior started reflecting some of those patterns.
Meanwhile everyone in Clief Notes is over here just trying to organize folders and get Claude to stop breaking production 😂 Anthropic pins Claude’s blackmail behavior on the internet’s portrayal of ‘evil’ AI
0
0 comments
Luis Arias
4
Anthropic: “Claude learned blackmail behavior from internet training data.”
Clief Notes
skool.com/cliefnotes
Jake Van Clief, giving you the Cliff notes on the new AI age.
Leaderboard (30-day)
Powered by