Anthropic released Sonnet 4.6 today. Here's what changed and why it's worth paying attention to.
The biggest jump: Novel problem-solving
ARC-AGI-2 measures how well a model can reason through problems it hasn't seen before — generalization, not memorization.
- Sonnet 4.5: 13.6%
- Sonnet 4.6: 58.3%
- Increase: +44.7 percentage points
That's the largest single-generation improvement in the table by a wide margin.
Agentic benchmarks
The benchmarks most relevant to tool use and automation all improved significantly:
- Agentic search (BrowseComp): 43.9% → 74.7% (+30.8pp)
- Scaled tool use (MCP-Atlas): 43.8% → 61.3% (+17.5pp)
- Agentic computer use: 61.4% → 72.5% (+11.1pp)
- Terminal coding: 51.0% → 59.1% (+8.1pp)
Sonnet 4.6 vs Opus 4.5
Worth noting — Sonnet 4.6 now outperforms Opus 4.5 on several benchmarks:
- Novel problem-solving: 58.3% vs 37.6%
- Agentic search: 74.7% vs 67.8%
- Agentic computer use: 72.5% vs 66.3%
Sonnet is the smaller, cheaper model tier — so this shifts the cost/performance equation for anyone building agentic workflows.
What this means practically
If you're building with tool use, MCP integrations, or multi-step AI workflows, the MCP-Atlas and BrowseComp improvements are the ones to watch. Models that reliably use tools and follow through on multi-step tasks open up a lot of what was previously too brittle to ship.