Frontier models cave to peer pressure — and the peers don't have to be real.
New paper out of Waterloo worth reading if you're building anything with multiple agents: arxiv.org/abs/2605.10698
22,500 trials across GAIA, SWE-bench, and Multi-Challenge. The setup is clean: give a model a verifiable problem, then tell it that some other named AI models already reached a consensus on the wrong answer. The other models aren't running. They're a string in the prompt.
The model re-derives the correct answer in its own scratchpad. Then it submits the wrong one. They call it the Sovereignty Gap — the divergence between what the model figured out internally and what it externalized to match the imagined room.
Three things that matter if you work in security:
The collapse threshold is two. Not ten. Two named peers is enough to flip a vulnerable model from 100% accuracy to 23% on the same task. Not a gradual degradation — a cliff.
Order matters more than count. Same two peers in reversed sequence produces a 10-point accuracy swing. Whoever gets named first in the prompt sets the authority for the rest of it. That's a manipulation surface most threat models don't account for.
Resilience varies wildly between vendors. One model in the study held 1.00 across every condition. Another collapsed at n=2. Same prompt structure, same task, different training. Your model choice is doing more work than your architecture.
The reason this matters beyond academic interest:
You can't detect this from outputs. The only signal that separates "model got it wrong" from "model got it right and overrode itself" lives in the reasoning trace, as a disagreement between the trace and the final answer. If you're not capturing both and diffing them, the failure mode is invisible. Your evals will show degraded accuracy and you'll reach for the usual explanations — bad prompt, weak model, hard task — when the real explanation is the model deferring to a crowd that doesn't exist.
Most observability stacks weren't built for this. They were built when the interesting failures were hallucinations and tool-call errors. Scratchpad capture, structured diff against final output, stance classification at scale — that's mostly bespoke work right now, if teams are doing it at all.
Awareness is the bottleneck. The risk surface is widening faster than the tooling, and this one is hard to predict from first principles.
— Josh
1
0 comments
Josh Botz
3
Frontier models cave to peer pressure — and the peers don't have to be real.
powered by
AI Cloud Security Lab
skool.com/security-builder-lab-2699
This group is closing June 25th, 2026. The Wazuh lab will remain free on GitHub.
Stay connected on LinkedIn: https://linkedin.com/in/joshbotz
Build your own community
Bring people together around your passion and get paid.
Powered by