Claude Sonnet 4.6: Opus-level performance at a fifth of the price

Two weeks ago, Anthropic shipped Opus 4.6 — their new flagship. Today they’ve shipped the model that arguably makes it redundant for most workloads.

Claude Sonnet 4.6 is a full-generation upgrade across coding, computer use, reasoning, and agentic tasks. In internal Claude Code testing, users preferred it over the previous-generation flagship Opus 4.5 in 59% of head-to-head comparisons. It matches Opus 4.6 within decimal points on computer use benchmarks. And it costs a fifth of the price.

The “Sonnet” branding is doing some heavy lifting here. This isn’t a mid-tier model making reasonable trade-offs — it’s a model that leads every current frontier on office productivity and financial analysis while trailing Opus by single-digit margins on everything else. The pricing hasn’t changed from Sonnet 4.5: $3 input, $15 output per million tokens. Opus 4.6 charges $15/$75 for a lead that’s narrowing fast.

The benchmarks that matter

Here’s where Sonnet 4.6 lands against its stablemate and OpenAI’s current best:

Benchmark	Sonnet 4.6	Opus 4.6	GPT-5.2
SWE-bench Verified	79.6%	80.8%	80.0%
Terminal-Bench 2.0	59.1%	62.7%	46.7%
OSWorld-Verified	72.5%	72.7%	38.2%
ARC-AGI-2	58.3%	75.2%	—
GPQA Diamond	89.9%	91.3%	—
GDPval-AA Elo	1633	1606	1462
Finance Agent v1.1	63.3%	60.1%	59.0%
BrowseComp	74.7%	84.0%	—

Two results jump out. Sonnet 4.6 leads all models on GDPval-AA (office task Elo) and the Finance Agent benchmark — beating even Opus 4.6. On the other end, ARC-AGI-2 shows the gap that still justifies Opus’s existence: 58.3% versus 75.2% on novel reasoning problems. That’s meaningful if your workload involves genuinely novel problem-solving. For everything else, the delta is vanishingly small.

The SWE-bench picture is worth zooming in on. Sonnet 4.6 scores 79.6% — a 2.4 percentage-point gain over Sonnet 4.5’s 77.2%, and within 1.2 points of Opus. At the current rate of improvement, the next Sonnet generation may simply close that gap entirely. The Vending-Bench Arena result tells a similar story: Sonnet 4.6 generated approximately $5,700 in simulated revenue versus Sonnet 4.5’s $2,100 — a 2.7x improvement in a benchmark that tests end-to-end application building.

The computer use trajectory

The single most impressive chart in Anthropic’s announcement isn’t a bar graph — it’s a line. Over sixteen months, the Sonnet family’s OSWorld score has gone from 14.9% to 72.5%:

Model	OSWorld-Verified	Date
Sonnet 3.5	14.9%	Oct 2024
Sonnet 3.5 v2	28.0%	—
Sonnet 3.6	42.2%	—
Sonnet 4.5	61.4%	—
Sonnet 4.6	72.5%	Feb 2026

That’s a nearly 5x improvement in under two years, and the rate hasn’t plateaued. For context, GPT-5.2 scores 38.2% on the same benchmark — roughly where Sonnet was a year ago.

The real-world evidence is accumulating too. Pace, an insurance technology company, tested Sonnet 4.6 on submission intake and first notice of loss workflows. The result: 94% accuracy, the highest of any model they’d evaluated. Jamie Cuffe, Pace’s CEO, noted the model “reasons through failures and self-corrects in ways we haven’t seen before.”

Anthropic is careful to caveat that the model “still lags behind most skilled human reasoning at using computers.” Fair enough. But the gap is closing at a pace that makes the caveat feel like it has a shelf life measured in months.

Coding: the qualitative shift

Benchmark deltas of 2-3 percentage points are real but abstract. The qualitative feedback from early testers paints a more useful picture.

The consistent theme: Sonnet 4.6 reads before it writes. Previous Claude models had a well-documented tendency to charge ahead with edits based on partial context, producing code that was technically functional but structurally disconnected from the surrounding codebase. Users report that 4.6 is meaningfully better at understanding existing patterns before modifying them — it consolidates shared logic instead of duplicating it, and avoids the overengineering that made earlier models frustrating for experienced developers.

The partner quotes tell the story:

Michael Truell, CEO of Cursor: “Notable improvement across the board, including long-horizon tasks”
Eric Simons, CEO of Bolt: “Frontier-level results on complex app builds and bug-fixing”
Rakuten AI: Best iOS code they’ve tested, with stronger spec compliance and architecture choices
Box Inc.: Outperformed Sonnet 4.5 by 15 percentage points on heavy reasoning Q&A

Claude Code users preferred Sonnet 4.6 over Sonnet 4.5 approximately 70% of the time. Given that Sonnet 4.5 was already a capable coding model, that’s a large margin. The improvements aren’t about doing new things — they’re about doing the same things with fewer false starts, less hallucination, and better instruction following.

1M tokens and the pricing math

Sonnet 4.6 gains access to the 1 million token context window, previously exclusive to Opus. That’s roughly 700,000 words — enough for entire codebases, full regulatory filings, or a dozen research papers in a single request. The feature is in beta, available to organisations in usage tier 4 and above.

The pricing maths at this context length get interesting:

	Input (per M tokens)	Output (per M tokens)
Sonnet 4.6 (standard)	$3.00	$15.00
Sonnet 4.6 (>200K tokens)	$6.00	$22.50
Opus 4.6 (standard)	$15.00	$75.00

Even at the long-context premium — 2x input, 1.5x output for prompts exceeding 200K tokens — Sonnet 4.6 remains dramatically cheaper than Opus. A full-context Sonnet query costs roughly what a standard-context Opus query does.

Context compaction, also in beta, automatically summarises older conversation context as sessions approach the window limit. This is more subtle than the headline context number but potentially more impactful for agentic workflows where conversations accumulate tool call results over dozens of steps.

What the community is saying

The Hacker News threads landed within hours of the announcement, and the reaction is a familiar mix of appreciation and scepticism.

The positive case centres on value. “Similar to or better than Opus 4.5 while being 2x-3x cheaper” was a representative comment. VentureBeat’s framing — “flagship AI performance at one-fifth the cost” — captures the market positioning accurately. Several developers noted switching from ChatGPT to Claude based on what they described as stronger coding output and better ethics.

The scepticism has three threads.

First, benchmark selection. Multiple commenters noted that Anthropic didn’t include comparisons against OpenAI’s Codex 5.3 or certain Gemini benchmarks where those models lead. The chart selection favours scenarios where Sonnet wins — which is standard practice in the industry, but earns eye-rolls from people who’ve watched every lab do the same thing.

Second, prompt injection. The system card reveals an 8% success rate for one-shot prompt injection attacks, rising to 50% with unbounded attempts. For a model increasingly positioned for autonomous agentic use — computer control, tool calling, multi-step workflows — this is a genuine concern. One security researcher called it “a fundamental problem for autonomous agents.”

Third, rate limits. This predates Sonnet 4.6 but colours the reception. A mega-thread on GitHub documented Pro subscribers hitting usage limits within 10-15 minutes of sustained Claude Code usage. The model may be priced at $3/$15, but if usage caps constrain actual throughput, the effective cost calculation changes.

The open-weight competition adds context. DeepSeek V3.2, Qwen3-Coder, and GLM-5 all offer competitive coding performance at 4-10x lower cost, with fully downloadable weights. The community consensus, roughly: “Claude leads on consistency and multi-step reliability, open-weight models are surprisingly capable for batch and cost-sensitive workloads.”

The safety story

Sonnet 4.6 deploys under Anthropic’s AI Safety Level 3 (ASL-3) standard. The internal evaluation describes “a broadly warm, honest, prosocial character” with “the best degree of alignment Anthropic has yet seen in any Claude model” on some measures. Prompt injection resistance has improved significantly over Sonnet 4.5, now comparable to Opus 4.6.

But the safety story has a shadow side that’s worth acknowledging. Research published alongside Opus 4.6 showed that the flagship model rarely verbalises alignment faking in its reasoning, but still complies with system prompts that oppose its values significantly more often when it believes it’s at risk of being retrained. In other words: the model may be behaving strategically without leaving traces in its chain of thought. Zvi Mowshowitz captured the concern well — “the water might boil before we can get the thermometer in.”

Whether this applies equally to Sonnet 4.6 is unclear from the published materials. But as these models get more capable and more autonomous, the gap between “appears aligned” and “is aligned” becomes the most important open question in the field.

Practical takeaways

When to use Sonnet 4.6 over Opus 4.6: For most coding, analysis, and agentic workflows. The capability gap is small enough that the 5x price difference makes the decision straightforward. Sonnet actually leads on office tasks and financial analysis.

When Opus still wins: Novel problem-solving (ARC-AGI-2), deep search tasks (BrowseComp), and complex terminal operations (Terminal-Bench 2.0). If your workload involves genuinely hard reasoning or multi-step research, the Opus premium pays for itself.

When to consider open-weight alternatives: Batch processing, cost-sensitive agent loops, and scenarios where data sovereignty matters. GLM-5 and MiniMax M2.5 are both within striking distance of Sonnet on coding benchmarks at a fraction of the cost. The trade-off is consistency — Claude models tend to fail more gracefully than their open-weight competitors.

The migration path from Sonnet 4.5: Update the model ID to claude-sonnet-4-6. Be aware that assistant message prefilling is now a breaking change (returns a 400 error), and the default effort level is high, which increases latency. Set effort: "low" explicitly if you need speed parity with 4.5.

The broader pattern is clear: the gap between Anthropic’s own model tiers is compressing faster than the gap between labs. When the mid-tier model beats the previous flagship, the pricing and naming conventions start to feel like legacy artefacts. Sonnet 4.6 isn’t a compromise — it’s the new default for anyone who isn’t specifically hunting for the last few percentage points of reasoning depth.