Sonnet 4.6 Released! — Benchmark Breakdown

Anthropic released Sonnet 4.6 today. Here's what changed and why it's worth paying attention to.

The biggest jump: Novel problem-solving

ARC-AGI-2 measures how well a model can reason through problems it hasn't seen before — generalization, not memorization.

Sonnet 4.5: 13.6%
Sonnet 4.6: 58.3%
Increase: +44.7 percentage points

That's the largest single-generation improvement in the table by a wide margin.

Agentic benchmarks

The benchmarks most relevant to tool use and automation all improved significantly:

Agentic search (BrowseComp): 43.9% → 74.7% (+30.8pp)
Scaled tool use (MCP-Atlas): 43.8% → 61.3% (+17.5pp)
Agentic computer use: 61.4% → 72.5% (+11.1pp)
Terminal coding: 51.0% → 59.1% (+8.1pp)

Sonnet 4.6 vs Opus 4.5

Worth noting — Sonnet 4.6 now outperforms Opus 4.5 on several benchmarks:

Novel problem-solving: 58.3% vs 37.6%
Agentic search: 74.7% vs 67.8%
Agentic computer use: 72.5% vs 66.3%

Sonnet is the smaller, cheaper model tier — so this shifts the cost/performance equation for anyone building agentic workflows.

What this means practically

If you're building with tool use, MCP integrations, or multi-step AI workflows, the MCP-Atlas and BrowseComp improvements are the ones to watch. Models that reliably use tools and follow through on multi-step tasks open up a lot of what was previously too brittle to ship.

1 comment