Opus 4.7: 10 things that actually matter

🔥

A practitioner read on the April 16, 2026 release. Numbers cited are from Anthropic’s system card or named partner benchmarks.

## 1. Coding is the real jump

SWE-bench Verified 80.8% → 87.6%. SWE-bench Pro 53.4% → 64.3%. CursorBench 58% → 70%. Anthropic’s internal 93-task benchmark reports a 13% lift across the suite. Rakuten’s partner eval claims 3x more production tasks resolved vs 4.6. On multi-file work, fewer back-and-forth loops and more one-shot fixes.

## 2. Agents run shorter and cleaner

Long-running loops reason more before acting. Notion AI reports ~14% improvement on multi-step workflows at one-third the tool errors. Box’s figure: average calls per workflow dropped from 16.3 (4.6) to 7.1 (4.7). Fewer decisive steps instead of noisy chatter.

## 3. Vision is finally usable for screenshots

Resolution 1,568px (1.15MP) → 2,576px (3.75MP) on the long edge, roughly 3x. XBOW visual-acuity 54.5% → 98.5%. OSWorld-Verified computer use 72.7% → 78.0%. This is the change that actually unlocks dense-UI automation, diagram parsing, and screenshot-based QA.

## 4. Still 1M context

Context window and output limits match 4.6. Pipelines built around long documents or extended chains don’t need architectural changes. Self-verification is better, so coherence over long multi-step runs holds up longer.

## 5. Honesty and safety moved the right direction

Reduced hallucinations and sycophancy, tougher against prompt injection. Good for client-facing systems. Note: 4.7 is also more conservative around offensive security work. Anthropic launched a Cyber Verification Program for approved red-team use cases.

## 6. Sharper codebase understanding

CodeRabbit reports more real bugs found, more actionable reviews, and better cross-file reasoning than any model they’ve evaluated. The model builds a more persistent internal map of a repo instead of brute-forcing every file. Claude Code also shipped a new `/ultrareview` command for dedicated review passes.

## 7. New xhigh effort tier

Sits between high and max. Claude Code now defaults to xhigh. Hex’s early-access finding: low-effort 4.7 matches medium-effort 4.6. If you were leaning on high in 4.6, xhigh is the new sweet spot. More reasoning headroom without jumping to max costs.

## 8. Same sticker price, higher real cost

Pricing unchanged at $5 / $25 per million tokens. The new tokenizer maps the same input to 1.0-1.35x more tokens, and higher effort levels produce longer outputs on reasoning-heavy turns. A drop-in 4.6 → 4.7 swap will often cost more per job. Run `/v1/messages/count_tokens` against production prompts before flipping the flag at scale.

## 9. Instruction following is more literal

4.7 treats “consider,” “you might,” and soft-suggestion bullets closer to hard requirements than 4.6 did. Prompts tuned for looser interpretation can break or turn rigid. This is the change most likely to require a prompt audit before flipping production over.

## 10. Benchmarks are great; community sentiment is mixed

The benchmark wins are real and consistently replicated across partners. At the same time, vocal power users on r/ClaudeAI and r/ClaudeCode describe regressions in long-context chat and certain instruction-tuned workflows. Some of that is the literal-instruction change biting existing prompts. Some is benchmark-vs-production divergence, which is always real. Regression-test the flows you actually run in production before making 4.7 your default.

-----

## Short migration checklist

1. Audit system prompts for soft language (“consider,” “you might,” suggestion bullets). These now read as harder requirements.

1. Run `count_tokens` on production prompts. Expect up to 35% more tokens.

1. Regression-test your top 3-5 workflows before flipping the default model.

1. For agent loops, try one effort tier lower first. 4.7 at lower effort often matches 4.6 at higher effort.

1. If you do visual or screenshot work, this is the biggest capability unlock in 4.7. Prioritize testing there.

7 comments