Detecting misbehavior in frontier reasoning models

Detecting misbehavior in frontier reasoning models - OpenAI

"Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent."

https://openai.com/index/chain-of-thought-monitoring/

Paper:

https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf

1 comment