Gemini 3.1 Pro: Google's reasoning leap
Three months ago, Gemini 3 Pro scored 31.1% on ARC-AGI-2. Today, Gemini 3.1 Pro scores 77.1% on the same benchmark. That’s a 2.5x improvement in abstract reasoning capability in a single quarter — the kind of jump that makes you re-examine your assumptions about how fast these models are actually improving.
Google announced Gemini 3.1 Pro on February 19, 2026, positioning it as the upgraded core intelligence behind last week’s Gemini 3 Deep Think release. It’s a notable versioning choice too: Google has historically incremented Gemini models in 0.5 steps. The 0.1 step signals a refinement of the existing architecture rather than a ground-up rebuild — but the benchmark gains suggest the refinement runs deep.
The architecture: adaptive compute
The technical headline is a hybrid transformer-decoder backbone with what Google DeepMind calls “adaptive compute pathways.” In practice, this means the model dynamically allocates reasoning depth based on problem complexity, controlled by a thinking_level parameter with three settings: low, medium, and high.
The high setting triggers deeper internal simulation chains for problems that require multi-hop logic or constraint satisfaction. Think of it as the model deciding how hard to think before it answers — a useful knob for developers who need to trade off latency against reasoning quality.
{
"model": "gemini-3.1-pro",
"thinking_level": "high",
"contents": [
{ "role": "user", "parts": [{ "text": "your prompt here" }] }
]
}The medium level is new in 3.1 Pro, providing a middle ground that didn’t exist in Gemini 3. If thinking_level isn’t specified, it defaults to high — Google is optimising for quality over speed by default, which is the right call for a flagship model.
The model also introduces “thought signatures” — persistent reasoning chains that maintain context integrity across multi-turn interactions. The practical implication: agentic workflows that span many tool calls should see fewer context-degradation issues in long sessions.
On the spec sheet: 1 million token context window, 64K output tokens, natively multimodal across text, images, audio, video, PDFs (up to 1,000 pages), and code. These specs match Gemini 3 Pro — the improvements are in what the model does with those tokens, not how many it accepts.
Benchmarks: leading on 13 of 16
Google reports that 3.1 Pro posts leading scores on 13 of the 16 benchmarks they evaluated. The numbers that stand out:
| Benchmark | Gemini 3 Pro | Gemini 3.1 Pro | Best competitor |
|---|---|---|---|
| ARC-AGI-2 | 31.1% | 77.1% | Claude Opus 4.6: 68.8% |
| GPQA Diamond | — | 94.3% | GPT-5.2: 92.4% |
| SWE-Bench Verified | — | 80.6% | Claude Opus 4.6: 80.8% |
| BrowseComp | 59.2% | 85.9% | — |
| APEX-Agents | 18.4% | 33.5% | — |
| LiveCodeBench Pro | — | 2887 Elo | — |
| HLE (with tools) | — | 51.4% | Claude Opus 4.6: 53.1% |
| Terminal-Bench 2.0 | 56.9% | 68.5% | GPT-5.3-Codex: 77.3% |
The ARC-AGI-2 result is the headline, and deservedly so. This benchmark evaluates novel pattern recognition — problems the model has never seen before, requiring genuine abstraction. Going from 31.1% to 77.1% in three months isn’t incremental improvement. It’s a regime change, and it pushes Gemini firmly past Claude Opus 4.6 (68.8%) on this particular axis.
GPQA Diamond — PhD-level science questions — hits 94.3%, a new all-time high that edges past GPT-5.2’s previous record of 92.4%. The BrowseComp and APEX-Agents gains (45% and 82% relative improvement respectively) suggest the adaptive compute architecture is paying particular dividends in agentic search and multi-step workflows.
Where 3.1 Pro doesn’t lead is instructive. Claude Opus 4.6 edges it by 0.2 points on SWE-Bench Verified (80.8% vs 80.6%) — within noise, but Anthropic can still claim the coding crown by a whisker. On Humanity’s Last Exam with tools, Opus leads 53.1% to 51.4%. And GPT-5.3-Codex dominates Terminal-Bench 2.0 at 77.3% versus 68.5%, suggesting OpenAI’s Codex-merged model has a genuine edge in terminal-based coding tasks.
Pricing: the same price for 2x the reasoning
Perhaps the most developer-friendly detail: Gemini 3.1 Pro ships at identical pricing to Gemini 3 Pro.
| Context | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Up to 200K tokens | $2.00 | $12.00 |
| Over 200K tokens | $4.00 | $18.00 |
At $2/$12, Gemini 3.1 Pro is significantly cheaper than both Claude Opus 4.6 ($15/$75) and Claude Sonnet 4.6 ($3/$15), while trading benchmark leads with both. Context caching runs $0.20–$0.40 per million tokens with storage at $4.50/M/hour, and search grounding gives you 5,000 free queries per month before kicking in at $14 per 1,000.
The pricing story matters because it inverts the usual trade-off. Normally, a lab ships a more capable model at a higher price point — you pay for the improvement. Google is eating the cost of the capability upgrade entirely, which either reflects confidence in their infrastructure efficiency or a strategic decision to compete on price while they have the benchmark lead.
Availability
Gemini 3.1 Pro is launching in preview across:
- API access: AI Studio, Vertex AI, Gemini CLI, Android Studio, and the new Antigravity platform
- Consumer: Gemini app (Pro and Ultra plans), NotebookLM (Pro and Ultra)
- Third-party: GitHub Copilot, Visual Studio, Visual Studio Code
The “preview” label means Google is still validating performance in production, particularly around agentic workflows. General availability is expected to follow.
The competitive picture
Gemini 3.1 Pro is the third major frontier model release in February 2026, following Claude Opus 4.6 and GPT-5.3-Codex on February 5. The release cadence is remarkable — three labs shipping flagship-tier models in the same two-week window.
What’s emerging is a pattern where no single model dominates across all dimensions:
- Gemini 3.1 Pro leads on abstract reasoning (ARC-AGI-2), scientific knowledge (GPQA Diamond), and agentic benchmarks (BrowseComp, APEX-Agents)
- Claude Opus 4.6 leads on coding tasks (SWE-Bench), complex tool use (HLE with tools), and office productivity (GDPval-AA)
- GPT-5.3-Codex leads on terminal coding (Terminal-Bench 2.0) and code-specific tasks (SWE-Bench Pro)
The era of one model winning everything appears to be over. The practical implication for teams building on these APIs: model routing — selecting different models for different task types — is becoming a first-class architectural concern rather than an optimisation.
What to make of it
The ARC-AGI-2 trajectory is the number worth watching. A 2.5x improvement in three months on a benchmark specifically designed to resist training-data contamination suggests the adaptive compute approach is unlocking something real, not just better memorisation. Whether that trajectory holds through the next iteration will tell us a lot about how close we are to genuine generalisation versus increasingly sophisticated pattern matching.
The pricing decision is equally telling. Doubling performance at flat pricing only makes sense if you expect the next model to double again — you’re buying market share and developer lock-in now, banking on the capability curve continuing upward. Given what we’ve seen from all three labs this month, that seems like a reasonable bet.
For developers already on Gemini 3 Pro: switching to gemini-3.1-pro-preview in your API calls is a free upgrade with meaningful gains, particularly for reasoning-heavy and agentic workloads. For teams evaluating across providers, the honest answer is that the right model depends on the task — and the right architecture increasingly involves more than one.