Skip to content

OpenAI's WebSocket mode: persistent connections for agentic workloads

· 6 min read

Every time an agentic coding assistant calls a tool, the typical flow looks something like: send an HTTP request, wait for the response, parse the tool call, execute it locally, then send another HTTP request with the result. Repeat twenty times and you’ve spent a non-trivial chunk of wall-clock time just establishing connections.

OpenAI’s Responses API now supports a WebSocket mode that keeps a persistent connection open across turns. For workflows with 20+ sequential tool calls, OpenAI reports roughly 40% faster end-to-end execution. That’s not a model speedup — it’s pure infrastructure overhead disappearing.

What WebSocket mode actually is

You open a single WebSocket connection to wss://api.openai.com/v1/responses and send response.create events over it. The message payload mirrors the normal Responses API body — same model, input, tools fields — minus transport-specific options like stream and background (streaming is implicit over WebSockets).

The key mechanism is previous_response_id chaining. After the first response completes, you continue the conversation by sending only new input items (tool outputs, the next user message) plus the ID of the prior response. The server keeps the most recent response state in a connection-local in-memory cache, so continuing from it skips the work of rehydrating context from storage.

python
from websocket import create_connection
import json
import os

ws = create_connection(
    "wss://api.openai.com/v1/responses",
    header=[f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}"],
)

# First turn
ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-4o",
    "input": [{
        "type": "message",
        "role": "user",
        "content": [{"type": "input_text", "text": "Find the bug in fizz_buzz()"}],
    }],
    "tools": [{"type": "function", "name": "read_file", ...}],
}))

# Read streaming events until response.done
response_id = None
for msg in ws:
    event = json.loads(msg)
    if event["type"] == "response.done":
        response_id = event["response"]["id"]
        break

Continuing is where the latency savings kick in:

python
# Second turn — only send the new tool output
ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-4o",
    "previous_response_id": response_id,
    "input": [{
        "type": "function_call_output",
        "call_id": "call_abc",
        "output": "def fizz_buzz(n):\n    ...",
    }],
    "tools": [{"type": "function", "name": "read_file", ...}],
}))

No full conversation replay. No new TCP handshake. Just the delta.

Where it makes a difference

The performance story has a clear inflection point. For a single-turn request, WebSocket mode saves maybe 3ms over HTTP+SSE — irrelevant. But the gains compound with each additional turn in a chain.

ScenarioEstimated improvement
Single turnNegligible (~3ms)
5 tool calls~10-15%
20+ tool calls~40%

The sweet spot is agentic workflows: coding assistants that read files, propose edits, run tests, and iterate. Orchestration loops where a model calls external APIs repeatedly. Any pattern where the model and tools are ping-ponging back and forth many times before producing a final result.

For a straightforward chatbot with one user message and one response? Stick with HTTP streaming. The added complexity of managing a WebSocket connection isn’t worth 3ms.

The warmup trick

There’s a subtle feature worth knowing about. You can send a response.create with generate: false to pre-load tools, instructions, and context before the real request:

python
# Warm up — no generation, just prepare state
ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-4o",
    "generate": False,
    "input": initial_context,
    "tools": tool_definitions,
    "instructions": system_prompt,
}))

# The warmup returns a response ID you can chain from
warmup_event = json.loads(ws.recv())
warmup_id = warmup_event["response"]["id"]

# Now the real turn starts faster
ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-4o",
    "previous_response_id": warmup_id,
    "input": [{"type": "message", "role": "user",
               "content": [{"type": "input_text", "text": "Start coding"}]}],
}))

This front-loads the setup work. If you know what tools and instructions you’ll use before the user’s message arrives, warmup can shave time off the first real turn’s time-to-first-token.

Context management and compaction

Long-running agent sessions eventually hit context limits. WebSocket mode supports two compaction patterns.

Server-side compaction is the simpler path. Enable context_management with a compact_threshold, and the server handles compaction automatically during normal generation. You don’t change anything about your WebSocket flow — just keep chaining with previous_response_id as usual.

Standalone compaction via the /responses/compact endpoint gives you more control. You make an HTTP call to compact your accumulated context, then start a fresh chain on the WebSocket using the compacted output:

python
# Compact via HTTP
compacted = client.responses.compact(
    model="gpt-4o",
    input=long_input_items,
)

# Start a new chain with compacted context
ws.send(json.dumps({
    "type": "response.create",
    "model": "gpt-4o",
    "input": [
        *compacted.output,
        {"type": "message", "role": "user",
         "content": [{"type": "input_text", "text": "Continue from here."}]},
    ],
}))

Note the absence of previous_response_id — compaction starts a new chain. Don’t prune the compacted window; pass it as-is.

SDK support

You don’t have to manage raw WebSocket connections yourself. The OpenAI Python SDK (v2.22.0+) has built-in WebSocket support, and the Agents SDK makes it even simpler:

python
from agents import Agent, responses_websocket_session

agent = Agent(name="Coder", instructions="You are a coding assistant.")

async with responses_websocket_session() as ws:
    first = ws.run_streamed(agent, "Read main.py")
    async for event in first.stream_events():
        pass

    second = ws.run_streamed(
        agent, "Now add error handling.",
        previous_response_id=first.last_response_id,
    )
    async for event in second.stream_events():
        pass

Or enable it globally with one line:

python
from agents import set_default_openai_responses_transport
set_default_openai_responses_transport("websocket")

On the JavaScript side, Vercel Labs has published ai-sdk-openai-websocket-fetch — a drop-in fetch replacement that transparently routes Responses API calls through a persistent WebSocket. Their demo app compares HTTP vs WebSocket time-to-first-byte side by side.

Sharp edges

A few things to watch for:

One response at a time. A single WebSocket connection processes requests sequentially. No multiplexing. If you need parallel runs, open multiple connections.

60-minute connection limit. The server closes the socket after an hour. Your client needs reconnection logic. If you’re using store=true, you can resume the chain on a new connection with the last previous_response_id. With store=false or Zero Data Retention, the in-memory cache dies with the connection — you’ll need to replay full context or use a compacted window.

Cache eviction on errors. If a turn fails (4xx or 5xx), the server evicts that previous_response_id from the connection cache. You can’t retry by re-sending the same continuation — you’ll get previous_response_not_found. Plan for fallback to full-context replay.

store=false means no safety net. With Zero Data Retention, response state only exists in the connection’s memory. Disconnect unexpectedly and it’s gone. This is by design — ZDR compatibility is a selling point — but it means your reconnection strategy matters.

WebSocket mode vs. the Realtime API

These are different things solving different problems. The Realtime API (wss://api.openai.com/v1/realtime) is designed for audio and multimodal streaming — voice assistants, real-time transcription, that sort of thing. WebSocket mode for the Responses API is about text-based tool-calling workflows where you want lower latency across many turns.

If you’re building a voice interface, you want the Realtime API. If you’re building a coding agent that makes dozens of tool calls per task, WebSocket mode is the one to reach for.

When to use it

The decision is straightforward. Count your expected tool calls per session:

  • 1-3 tool calls: HTTP streaming. Simpler client code, nearly identical performance.
  • 5-10 tool calls: Either works. WebSocket mode starts showing measurable gains but the complexity trade-off is marginal.
  • 10+ tool calls: WebSocket mode. The latency savings compound and the persistent connection starts paying for itself.
  • 20+ tool calls: WebSocket mode, without question. You’re leaving 40% performance on the table otherwise.

The OpenAI Agents SDK makes the switch almost trivial — one line to change the transport globally. If you’re already using the Agents SDK for agentic workflows, there’s very little reason not to try it.

If you enjoyed this, you might also like

👤

Written by

Daniel Dewhurst

Lead AI Solutions Engineer building with AI, Laravel, TypeScript, and the craft of software.

Comments