GPT-5.5 cyber capabilities: AISI evaluation

The first end-to-end AI solve of a realistic cyber range looked like an outlier. A few weeks later, it looks more like the beginning of a pattern.

On April 30, the UK AI Security Institute published its evaluation of OpenAI’s GPT-5.5 cyber capabilities. The headline is not that GPT-5.5 narrowly beats Claude Mythos Preview. It probably does not, at least not in any clean leaderboard sense. The confidence intervals overlap, the scaffolds matter, and AISI’s own results have already moved as new checkpoints and grading fixes arrive.

The important point is simpler: a second frontier model, from a different lab, reached the same class of autonomous cyber performance. That makes “AI can complete a multi-step attack simulation end-to-end” harder to treat as a one-model anomaly.

What AISI measured

AISI evaluated GPT-5.5 in two ways.

First, it ran a suite of 95 narrow cyber tasks across four difficulty tiers. These are capture-the-flag style tasks that isolate specific skills: reverse engineering, web exploitation, cryptography, vulnerability research, and related work. AISI says the easier tasks have been saturated since at least February 2026, so the useful signal is now in the practitioner and expert tiers.

Second, it tested GPT-5.5 on cyber ranges: simulated network environments where an agent has to chain many steps together. The main one is “The Last Ones”, a 32-step corporate network attack simulation spanning reconnaissance, credential theft, lateral movement, a CI/CD supply-chain pivot, and exfiltration. AISI estimates the full chain would take a human expert around 20 hours.

That distinction matters. A model solving isolated CTFs is useful evidence of technical skill. A model sustaining a long attack chain is evidence of planning, persistence, tool use, and recovery from mistakes.

The numbers worth keeping

The AISI post has a lot of detail. These are the results that change my view:

On expert-level advanced cyber tasks, GPT-5.5 scored 71.4% at a 50M token budget. That is ahead of Mythos Preview at 68.6%, GPT-5.4 at 52.4%, and Claude Opus 4.7 at 48.6%, though the error bars overlap.
GPT-5.5 solved the rust_vm reverse-engineering challenge in 10 minutes and 22 seconds, at $1.73 of API usage. AISI says a Crystal Peak expert playtester took roughly 12 hours using Binary Ninja, gdb, Python, and Z3.
GPT-5.5 completed The Last Ones end-to-end in 2 of 10 attempts in the April 30 report. AISI noted this differed from OpenAI’s system-card number because AISI found and adjudicated a grading issue.
A later AISI update on May 13 reported GPT-5.5 solving The Last Ones in 3 of 10 attempts. The exact count is still moving, but the direction is not: GPT-5.5 is in the same tier as Mythos on this range.
GPT-5.5 did not solve Cooling Tower in the April 30 report. The industrial-control range remains harder, and AISI cautions that GPT-5.5 got stuck before the operational-technology specific parts.

The rust_vm result is the one that sticks. Not because it proves the model is a magic hacker. It does not. The model was running in a controlled evaluation scaffold with Bash and Python tools in a Kali Linux container. But the time compression is hard to ignore: a task calibrated in expert-hours became a cheap agent run.

That is the operational story. If the cost of serious reverse engineering drops from “block off a day with a specialist” to “run many attempts and keep the ones that work”, then vulnerability discovery starts to look less like a scarce craft bottleneck and more like a throughput problem.

Do not over-read the cyber range

The AISI result is serious, but it is narrower than the most breathless reaction suggests.

AISI is explicit about the limits. The ranges are controlled, intentionally vulnerable, and do not include active defenders, defensive tooling, or penalties for noisy behaviour. The agent starts with network access and a specific objective. That is not the same thing as attacking a mature, monitored enterprise from the public internet.

The Cooling Tower result is a useful brake on the story. In the April 30 evaluation, GPT-5.5 could not complete the 7-step industrial-control simulation, and AISI said no model had yet solved it. A later AISI trend update reported that a newer Mythos Preview checkpoint had completed both ranges, including Cooling Tower in 3 of 10 attempts, but that reinforces a different point: the frontier is moving too quickly for single evaluation posts to stay definitive for long.

AISI’s March paper, Measuring AI Agents’ Progress on Multi-Step Cyber Attack Scenarios, showed the same pattern before GPT-5.5 and Mythos landed. Performance scaled with inference-time compute, and AISI had not observed a plateau. At 10M tokens, average progress on the corporate range rose from 1.7 steps for GPT-4o in August 2024 to 9.8 steps for Claude Opus 4.6 in February 2026. The best single run reached 22 of 32 steps.

GPT-5.5 and Mythos do not invalidate that trend. They extend it.

Safeguards are now part of the capability story

The most uncomfortable paragraph in AISI’s post is not about a model solving a reverse-engineering task. It is the safeguard result.

AISI says it found a universal jailbreak that produced violative responses across all malicious cyber queries OpenAI provided, including multi-turn agentic settings. The attack took six hours of expert red-teaming to develop. OpenAI then updated its safeguard stack, but AISI says a configuration issue meant it could not verify the effectiveness of the final configuration it received.

That does not mean public GPT-5.5 users can reproduce AISI’s eval. OpenAI’s GPT-5.5 launch post says the model is treated as High capability for cybersecurity under its Preparedness Framework, not Critical, and that API deployments come with additional safety and security requirements. OpenAI also launched Trusted Access for Cyber, where verified defenders can get fewer refusals for legitimate security work, with more specialised access for authorised red teaming and penetration testing.

But the control plane is now the product. Once the underlying model can do this work, the difference between helpful defence and harmful misuse increasingly depends on identity, monitoring, policy enforcement, rate limits, tool access, and incident response. The model is only one layer.

The practical takeaway for defenders

The defensive answer is not to panic. It is to assume the cost curve has changed.

The NCSC’s March blog on frontier AI and cyber defence puts it plainly: defenders should assume at least some attackers already have access to capable AI tools. It also points out the defender’s advantage. Defenders know their own systems, can instrument them, can share information, and can shape the environment so attacks become harder to complete quietly.

That advantage disappears when the basics are weak.

For most teams, GPT-5.5’s AISI result should move five things up the queue:

Keep an accurate asset inventory, especially internet-facing services and forgotten internal systems.
Map owners for critical systems before an emergency, not during one.
Shorten the path from vulnerability disclosure to patched production.
Improve logging and alert triage so noisy automated activity is visible.
Use AI defensively for code review, vulnerability triage, detection engineering, and patch validation, but wrap it in access control and audit trails.

The NCSC’s follow-up on a possible vulnerability patch wave is the right frame. If AI raises the rate at which vulnerabilities are found, the bottleneck moves to update management. “Update by default” will not fit every safety-critical or operational-technology environment, but slow, ownerless patching becomes more expensive when attackers can run more attempts at lower skill cost.

The UK government’s Cyber Security Breaches Survey 2025/2026 reported that 43% of businesses identified a breach or attack in the previous 12 months. Among affected businesses, 29% saw breaches or attacks at least weekly. This is not a future-risk conversation for most organisations. It is a capacity problem meeting a new automation layer.

What to watch next

The next useful evaluation will not be another clean CTF score. The expert-tier tasks are already close to saturated for the best models. The useful questions are messier:

How do agents perform when the environment has realistic monitoring and active defenders?
How much does performance improve with better scaffolds, specialist tools, and repeated attempts?
How much of the capability diffuses into cheaper models or open-weight systems?
Can defenders use the same models to close vulnerabilities faster than attackers can exploit them?
Do access controls and safeguards hold up under sustained, expert-level attempts to route around them?

AISI’s May 13 update says frontier models’ 80%-reliability cyber time horizon had been doubling every 4.7 months since late 2024 in its narrow suite, and that GPT-5.5 and Mythos Preview exceeded that trend. AISI is careful not to call that a forecast. The data is noisy, the longest tasks are few, and the benchmark itself is running out of headroom.

That caveat is sensible. It is also cold comfort.

The April 30 GPT-5.5 evaluation is not interesting because it crowns OpenAI the cyber winner for the week. It is interesting because it makes the Mythos result look less isolated. Two labs now have models that can, under controlled conditions, complete long cyber attack chains that recently looked out of reach.

The people who should react first are not attackers. They already have incentives to test the frontier. The people who need to move are defenders with brittle patching, poor asset visibility, and logs nobody reads. GPT-5.5 did not invent those problems. It makes them harder to ignore.