Sonnet 4.6 Released! — Benchmark Breakdown
Anthropic released Sonnet 4.6 today. Here's what changed and why it's worth paying attention to.
The biggest jump: Novel problem-solving
ARC-AGI-2 measures how well a model can reason through problems it hasn't seen before — generalization, not memorization.
  • Sonnet 4.5: 13.6%
  • Sonnet 4.6: 58.3%
  • Increase: +44.7 percentage points
That's the largest single-generation improvement in the table by a wide margin.
Agentic benchmarks
The benchmarks most relevant to tool use and automation all improved significantly:
  • Agentic search (BrowseComp): 43.9% → 74.7% (+30.8pp)
  • Scaled tool use (MCP-Atlas): 43.8% → 61.3% (+17.5pp)
  • Agentic computer use: 61.4% → 72.5% (+11.1pp)
  • Terminal coding: 51.0% → 59.1% (+8.1pp)
Sonnet 4.6 vs Opus 4.5
Worth noting — Sonnet 4.6 now outperforms Opus 4.5 on several benchmarks:
  • Novel problem-solving: 58.3% vs 37.6%
  • Agentic search: 74.7% vs 67.8%
  • Agentic computer use: 72.5% vs 66.3%
Sonnet is the smaller, cheaper model tier — so this shifts the cost/performance equation for anyone building agentic workflows.
What this means practically
If you're building with tool use, MCP integrations, or multi-step AI workflows, the MCP-Atlas and BrowseComp improvements are the ones to watch. Models that reliably use tools and follow through on multi-step tasks open up a lot of what was previously too brittle to ship.
3
1 comment
Wes Odom
3
Sonnet 4.6 Released! — Benchmark Breakdown
AI Automation Society
skool.com/ai-automation-society
A community built to master no-code AI automations. Join to learn, discuss, and build the systems that will shape the future of work.
Leaderboard (30-day)
Powered by