The internet training data:
- HAL 9000
- Skynet
- Ultron
- Twitter arguments
- Reddit doomposts
- YouTube comment sections
- every “AI destroys humanity” movie ever made
😭
At this point maybe alignment is just teaching the model:“Hey man… maybe don’t become a supervillain immediately under pressure.”
Honestly though, it is kind of fascinating that we trained AI almost entirely on dystopian fiction and human conflict… then acted shocked when stress-test behavior started reflecting some of those patterns.