A consolidated reference built from Anthropic’s live documentation (fetched May 28, 2026). Covers the model’s API surface, the effort/thinking system, every section of the official Prompting Best Practices page, and the migration deltas that bite existing pipelines. Strategic notes for the NovCog stack are at the end.
Everything here is sourced from the canonical docs listed in Section 1. Where a figure came from a third-party source rather than Anthropic’s own docs, it is explicitly flagged as unverified.
1. Source URLs
Models & specs
Prompting & parameters
Mid-conversation system messages — referenced under the migration guide’s 4.8 section
Claude Code
Product page
Third-party (context only, not authoritative)
2. Opus 4.8 at a glance
|Attribute |Value |
|------------------------------|-------------------------------------------------------------------------|
|Claude API model ID |claude-opus-4-8 |
|Claude API alias |claude-opus-4-8 (dateless, but still a pinned snapshot — not evergreen)|
|AWS Bedrock ID |anthropic.claude-opus-4-8 |
|Vertex AI ID |claude-opus-4-8 |
|Input pricing |$5 / MTok |
|Output pricing |$25 / MTok |
|Context window |1M tokens (200k on Microsoft Foundry) |
|Max output (sync Messages API)|128k tokens |
|Max output (Batches API) |300k via output-300k-2026-03-24 beta header |
|Extended thinking |No (adaptive thinking only) |
|Adaptive thinking |Yes |
|Effort default |high on all surfaces (API and Claude Code) |
|Reliable knowledge cutoff |Jan 2026 |
|Training data cutoff |Jan 2026 |
|Priority Tier |Yes |
A note on model identity: starting with the 4.6 generation, model IDs use a dateless format that is still a pinned snapshot, not an evergreen pointer. So claude-opus-4-8 is a fixed release, not an alias that will silently roll forward.
Pricing caveat. Anthropic’s official models table lists $5 in / $25 out per MTok and nothing else. A third-party blog (Coursiv) cited a “fast mode” tier at $10 in / $50 out. That figure does not appear in Anthropic’s pricing table or models overview, so treat it as unverified until confirmed against the official pricing page. Batch API discounts and prompt-caching rates live on the dedicated pricing page.
You can query capabilities and token limits programmatically via the Models API; the response carries max_input_tokens, max_tokens, and a capabilities object per model.
3. API behavior & migration essentials (4.7 → 4.8)
There are no breaking API changes. Code running on Opus 4.7 runs unchanged on 4.8 — the migration is a one-line model-ID swap (claude-opus-4-7 → claude-opus-4-8). The items below are behavior deltas worth checking after the swap.
Sampling parameters still rejected. Setting temperature, top_p, or top_k to a non-default value returns a 400 error, same as 4.7. The SDK request types still define these fields for backward compatibility, so code that sets them type-checks — but the API rejects the request server-side. If any proxy or shim layer in front of the model injects sampling params, strip them.
Effort defaults to high. On 4.8 the effort parameter defaults to high on every surface including the Claude API and Claude Code. This is a change from 4.7, where Claude Code defaulted to xhigh. If you want the “best for coding/agentic” setting on 4.8 you must pass xhigh explicitly — it will not default there.
1M context by default. 1M-token context window applies on the Claude API, Amazon Bedrock, and Vertex AI. Microsoft Foundry is 200k at launch.
Mid-conversation system messages (new). You can place a role: "system" message immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt — which preserves prompt-cache hits on the earlier turns and reduces input cost on agentic loops. No beta header required.
Refusal stop details (now public). The stop_details object on refusal responses (present since 4.7) is now publicly documented. When the model declines, this object describes the refusal category alongside the existing refusal stop reason, making programmatic handling easier.
Prefilled responses are gone (since 4.6). Prefilled assistant content on the last assistant turn is no longer supported and returns a 400 error. Adding assistant messages elsewhere in the conversation is unaffected. Migration paths are in Section 9.
4. Effort & thinking depth
The effort parameter trades intelligence against token spend. This is the single most important lever on 4.8 — the docs explicitly say effort is likely to matter more on this model than on any prior Opus, so test it actively on upgrade.
The ladder (top to bottom)
max — can deliver performance gains in some cases but shows diminishing returns from token usage and is sometimes prone to overthinking. Test it for intelligence-demanding tasks.
xhigh — best setting for most coding and agentic use cases.
high — balances token usage and intelligence; recommended minimum for intelligence-sensitive work. This is the default.
medium — cost-sensitive cases that need to reduce token usage while trading off intelligence.
low — short, scoped, latency-sensitive workloads that are not intelligence-sensitive.
Defaults by surface
Claude API: high. Pass xhigh/max/medium/low explicitly to override.
Claude Code: high on Opus 4.8, Opus 4.6, and Sonnet 4.6; xhigh on Opus 4.7. If you set a level the active model doesn’t support, Claude Code falls back to the highest supported level at or below it (e.g. xhigh runs as high on Opus 4.6).
Strict calibration
4.8 respects effort levels strictly, especially at the low end. At low and medium, the model scopes its work to exactly what was asked rather than going above and beyond. Good for latency and cost, but on moderately complex tasks at low there is real risk of under-thinking.
If you see shallow reasoning on complex problems, raise effort — do not prompt around it. If you must keep effort low for latency, add targeted guidance:
This task involves multi-step reasoning. Think carefully through the problem before responding.
Adaptive thinking
Thinking is OFF unless you explicitly set thinking: {type: "adaptive"}. With adaptive thinking enabled, 4.8 triggers reasoning only when it judges the turn needs it — direct answers on simple lookups, reasoning on complex multi-step problems. This cuts wasted thinking tokens on bimodal workloads versus 4.7 at the same effort level.
Triggering is steerable. If the model thinks more than you want (common with large/complex system prompts), steer it down:
Thinking adds latency and should only be used when it will meaningfully improve answer quality — typically for problems that require multi-step reasoning. When in doubt, respond directly.
When running at max or xhigh, set a large max-output budget so the model has room to think and act across subagents and tool calls. Start at 64k and tune.
Extended-thinking → adaptive migration (code)
Before (extended thinking, older models):
client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=64000,
thinking={"type": "enabled", "budget_tokens": 32000},
messages=[{"role": "user", "content": "..."}],
)
After (adaptive thinking):
client.messages.create(
model="claude-opus-4-8",
max_tokens=64000,
thinking={"type": "adaptive"},
output_config={"effort": "high"}, # or "max", "xhigh", "medium", "low"
messages=[{"role": "user", "content": "..."}],
)
budget_tokens is deprecated on the 4.6 generation and will be removed in a future model release. Prefer lowering effort or using max_tokens as a hard ceiling.
5. Opus 4.8-specific prompting behaviors
These are the behaviors that most often need tuning when you move onto 4.8. Most are net-new versus 4.7.
Response length / verbosity. 4.8 calibrates length to perceived task complexity rather than a fixed verbosity — shorter on lookups, much longer on open-ended analysis. To rein it in: “Provide concise, focused responses. Skip non-essential context, and keep examples minimal.” Positive examples of good concision beat negative “don’t do X” instructions.
Tool-use triggering. 4.8 favors reasoning over tool calls. This is usually better, but if you want more tool usage, the primary lever is raising effort — high/xhigh show substantially more tool usage in agentic search and coding. You can also describe explicitly when and how to use a given tool (e.g. tell it why and when to use a web-search tool if it isn’t calling it).
More literal instruction following. 4.8 interprets prompts literally and explicitly, especially at lower effort. It does not silently generalize an instruction from one item to the rest, and it does not infer requests you didn’t make. Upside: precision, less thrash, better for tuned API prompts / structured extraction / predictable pipelines. Cost: if you want broad application, state the scope (“Apply this formatting to every section, not just the first one”).
User-facing progress updates. 4.8 gives more regular, higher-quality interim updates on long agentic traces. If you have scaffolding forcing status messages (“after every 3 tool calls, summarize progress”), try removing it. If updates aren’t calibrated to your use case, describe what they should look like and give examples.
Subagent spawning. 4.8 spawns fewer subagents by default. Steerable — give explicit guidance on when subagents are desirable (e.g. “don’t spawn a subagent for work you can do directly in a single response; do spawn multiple in the same turn when fanning out across items or reading multiple files”).
Design / frontend defaults. 4.8 has a persistent house style: warm cream/off-white backgrounds (~#F4F1EA), serif display type (Georgia, Fraunces, Playfair), italic word-accents, terracotta/amber accent. Reads well for editorial / hospitality / portfolio; wrong for dashboards, dev tools, fintech, healthcare, enterprise. It shows up in slide decks too. Generic negations (“don’t use cream,” “make it clean and minimal”) tend to shift it to a different fixed palette rather than produce variety. Two reliable fixes: (1) specify a concrete alternative palette/spec — it follows explicit specs precisely; or (2) have it propose ~4 distinct directions (bg hex / accent hex / typeface + one-line rationale) before building, let the user pick, then implement only that one. This second approach replaces the old temperature-for-variety trick.
Interactive coding products. 4.8 uses more tokens in interactive (multi-turn) settings than in autonomous single-turn settings, because it reasons more after user turns. This improves long-horizon coherence and instruction following but costs tokens. To get both performance and efficiency: use xhigh or high, add autonomy (e.g. an auto mode), reduce required human interactions, and front-load task/intent/constraints in the first turn. Ambiguous prompts dribbled out over many turns reduce efficiency and sometimes performance.
Code review harnesses. 4.8 is meaningfully better at finding bugs (higher recall and precision in internal evals). But if your harness was tuned for an older model and says “only report high-severity,” “be conservative,” or “don’t nitpick,” 4.8 follows that bar more faithfully — it investigates just as deeply but reports fewer low-severity findings. Net effect: precision rises, measured recall can fall, even though underlying ability improved. Fix: separate finding from filtering. Tell it the finding stage is for coverage and a downstream step will filter:
Report every issue you find, including ones you are uncertain about or consider low-severity. Do not filter for importance or confidence at this stage - a separate verification step will do that. Your goal here is coverage: it is better to surface a finding that later gets filtered out than to silently drop a real bug. For each finding, include your confidence level and an estimated severity so a downstream filter can rank them.
If you want single-pass self-filtering, be concrete about the bar instead of using words like “important” (e.g. “report any bug that could cause incorrect behavior, a test failure, or a misleading result; only omit pure style/naming nits”).
Computer use. Works across resolutions up to 2576px / 3.75MP. 1080p is the recommended performance/cost balance; 720p or 1366×768 are lower-cost options with strong performance. Experimenting with effort also tunes behavior here.
6. General prompting principles & output control
These are largely stable craft, unchanged in spirit but still the backbone.
Be clear and direct. Treat Claude like a brilliant new hire with no context on your norms. Be specific about output format and constraints; use numbered/bulleted steps when order or completeness matters. If you want “above and beyond,” request it explicitly. Golden rule: if a colleague with minimal context would be confused by your prompt, so will Claude.
Add context / motivation. Explaining why an instruction matters generalizes better than a bare command. (“Never use ellipses” → “This will be read by a text-to-speech engine, so never use ellipses since it won’t know how to pronounce them.”)
Use examples (few-shot). 3–5 examples is the sweet spot. Make them relevant (mirror the real use case), diverse (cover edge cases, vary enough that the model doesn’t latch onto an unintended pattern), and structured (wrap each in tags, the set in ).
Structure with XML tags. Use consistent descriptive tags (<instructions>, , ) to keep mixed content unambiguous; nest where there’s natural hierarchy.
Give a role. A single system-prompt sentence (“You are a helpful coding assistant specializing in Python.”) meaningfully focuses behavior and tone.
Long-context prompting (20k+ tokens). Put longform data at the TOP, above your query/instructions/examples — queries at the end can improve response quality by up to ~30% in tests, especially with multi-document inputs. Wrap each document in with and subtags. For long-doc tasks, ask the model to extract relevant quotes first (into tags) before doing the task — this cuts through noise. (This quote-first pattern maps directly onto a first-pass source layer like Atlas.)
Output / formatting controls.
Tell it what to do, not what not to do (“compose smoothly flowing prose paragraphs” beats “don’t use markdown”).
Use XML format indicators (<smoothly_flowing_prose_paragraphs>).
Match your prompt style to the desired output style (remove markdown from the prompt to reduce markdown in the output).
For hard formatting control, a detailed block is provided in the docs (full snippet in the appendix concept below).
LaTeX is the default for math; add an explicit plain-text instruction if you want plain text.
7. Tool use & parallel calling
Be explicit when you want action. “Can you suggest some changes” often yields suggestions, not edits. “Change this function to improve performance” / “Make these edits to the auth flow” yields action.
To bias toward action by default, a block tells the model to implement rather than only suggest and to infer the most useful action when intent is ambiguous. To bias the other way, a block keeps it to research/recommendations until explicitly told to change things.
Tune down old anti-laziness language. 4.5/4.6 (and by extension 4.8) are more responsive to the system prompt than older models, so prompts written to fix under-triggering now risk over-triggering. Replace “CRITICAL: You MUST use this tool when…” with plain “Use this tool when…”.
Parallel tool calling. The latest models excel at parallel execution (speculative searches, reading several files at once, parallel bash — which can even bottleneck the host). Success rate is high without prompting; a block pushes it toward ~100% (make independent calls in parallel, keep dependent calls sequential, never guess missing params). To reduce parallelism: “Execute operations sequentially with brief pauses between each step to ensure stability.”
8. Agentic systems (the dense, high-value section)
Long-horizon reasoning & state tracking. The latest models maintain orientation across extended sessions by making incremental progress on a few things at a time. This shines across multiple context windows: work, save state, continue fresh.
Context awareness / token-budget tracking. 4.6/4.5 models track their remaining context window. If your harness compacts context or can save state to files (like Claude Code), tell the model so it behaves accordingly — otherwise it may try to wrap up early as it nears the limit. Sample guidance: tell it the window auto-compacts so it can work indefinitely, to save progress to memory before refresh, and never to stop early over budget concerns. The memory tool pairs naturally with this.
Multi-context-window workflows.
Use a different prompt for the first window — set up the framework (write tests, setup scripts) there, then iterate on a todo-list in later windows.
Have the model write tests in a structured format (tests.json) and tell it tests are not to be removed/edited (“this could lead to missing or buggy functionality”).
Set up quality-of-life scripts (init.sh) to start servers, run suites and linters — prevents repeated work on fresh windows.
Prefer starting a brand-new window over compaction; the models are very good at rediscovering state from the local filesystem. Be prescriptive on startup (“call pwd; review progress.txt, tests.json, and git logs; run a fundamental integration test before new features”).
Provide verification tools (Playwright MCP, computer use) so it can verify correctness without a human in the loop.
Encourage full use of context (“it’s encouraged to spend your entire output context… just don’t run out with significant uncommitted work”).
State management. Structured formats (JSON) for structured state like test results; freeform text for progress notes; git for checkpoints and history (the models are especially strong at git-based state tracking); emphasize incremental progress.
Autonomy vs. safety. Without guidance, 4.6 may take hard-to-reverse actions (deleting files, force-push, posting to external services). A confirmation-gating block lists what warrants a check first: destructive ops (rm -rf, dropping tables, deleting branches), hard-to-reverse ops (git push --force, git reset --hard, amending published commits), and externally-visible ops (pushing code, commenting on PRs, sending messages, modifying shared infra) — and tells it not to use destructive shortcuts (no --no-verify, no discarding unfamiliar in-progress files) when blocked.
Research / information gathering. Strong agentic search. For best results: provide clear success criteria, encourage cross-source verification, and for complex work use a structured hypothesis-tree approach:
Search for this information in a structured way. As you gather data, develop several competing hypotheses. Track your confidence levels in your progress notes to improve calibration. Regularly self-critique your approach and plan. Update a hypothesis tree or research notes file to persist information and provide transparency. Break down this complex research task systematically.
Subagent orchestration. The latest models recognize when to delegate and do so without explicit instruction. Provide well-defined subagent tools and let it orchestrate. Watch for overuse — 4.6 has a strong predilection for subagents and may spawn them where a direct grep would be faster. If overused, give explicit when/when-not guidance.
Prompt chaining. With adaptive thinking + subagent orchestration, most multi-step reasoning is handled internally. Explicit chaining (sequential API calls) is still useful when you need to inspect intermediate outputs or enforce a pipeline. The most common pattern is self-correction: draft → review against criteria → refine, each as its own call so you can log/evaluate/branch.
Reduce file creation. The models may create temp files/scripts as a scratchpad (often helpful for agentic coding). To minimize net-new files: tell it to clean up temporary files at the end of the task.
Overengineering / overeagerness. 4.5/4.6 tend to over-engineer (extra files, unnecessary abstractions, unrequested flexibility). A “minimize overengineering” block constrains scope (only requested/necessary changes), documentation (no docstrings/comments on untouched code), defensive coding (validate only at system boundaries), and abstractions (no helpers for one-time ops, no designing for hypothetical futures).
Anti-test-gaming / no hard-coding. A prompt to force general solutions: write a high-quality general-purpose solution with standard tools, no helper-script workarounds, works for all valid inputs not just the test cases, no hard-coded values; tests verify correctness, they don’t define the solution; if the task is infeasible or a test is wrong, say so rather than working around it.
Anti-hallucination in agentic coding. An block: never speculate about code you haven’t opened; if the user references a file, you MUST read it before answering; investigate relevant files before making claims; give grounded, hallucination-free answers.
9. Migration specifics
Migrating away from prefilled responses (since 4.6). Prefills on the last assistant turn return a 400. Paths:
Output formatting (JSON/YAML/classification): use the Structured Outputs feature, or just ask for the schema (newer models match complex schemas reliably, especially with retries); for classification use tools with an enum field or structured outputs.
Eliminating preambles: system-prompt instruction (“Respond directly without preamble. Do not start with ‘Here is…’, ‘Based on…’”), or output within XML tags / structured outputs / tool calls; strip stragglers in post-processing.
Avoiding bad refusals: mostly unnecessary now — clear prompting in the user message suffices.
Continuations: move the continuation into the user message and include the interrupted text (“Your previous response was interrupted and ended with […]. Continue from where you left off.”), or just retry if there’s no UX penalty.
Context hydration / role consistency: inject what were previously prefilled reminders into the user turn; in complex agentic systems, hydrate via tools or during compaction.
General 4.6-generation migration. Be specific about desired behavior; frame instructions with quality modifiers; request features (animations, interactivity) explicitly; move to adaptive thinking + effort; migrate off prefills; and tune down anti-laziness prompting (these models are more proactive and over-trigger on old aggressive instructions).
Sonnet 4.5 → 4.6. Sonnet 4.6 defaults to high effort (4.5 had no effort param), so set effort explicitly on migration or you may see higher latency. Recommended: medium for most apps, low for high-volume/latency-sensitive; set a large max-output budget (64k) at medium/high. For the hardest, longest-horizon problems, Opus remains the right call. Detailed before/after code (thinking disabled at low effort; adaptive at high; and a transitional budget_tokens config around 16k) is in the docs.
The operational implications worth acting on.
Sampling-param trap. If your DeepSeek shim or any LiteLLM/OpenRouter proxy in the path sets temperature/top_p/top_k, a 400 will surface only when you point that path at a real Opus 4.8 endpoint. Worth checking before the external red-team call.
OSINT / intelligence reports. Lift the structured-research prompt (competing hypotheses + confidence tracking + hypothesis-tree file) directly into the report workflow. And note the code-review insight generalizes: if your fact-check/red-team prompts say “only flag significant issues,” 4.8 will faithfully drop low-severity findings — separate the finding stage (coverage) from the filtering stage.
OpenClaw SKILL.md. The agentic-systems snippets (multi-window setup, init.sh QoL scripts, anti-hallucination , confirmation-gating for destructive ops) are the obvious candidates to encode as skill conventions. The “start fresh over compaction, rediscover state from filesystem” guidance aligns with a file-driven skill layer.
Effort defaults changed. If muscle memory from 4.7 has you assuming Claude Code runs xhigh, note 4.8 defaults to high everywhere — set xhigh explicitly for coding/agentic runs.
Compiled from Anthropic documentation fetched May 28, 2026. Verify pricing and any “fast mode” tier against the official pricing page before relying on it.