Skip to content

GLM-5: the open-weight model that's rewriting the frontier economics

· 8 min read

On February 6, a model called “Pony Alpha” appeared on OpenRouter with no attribution, no branding, and free API access. Within hours, the r/LocalLLaMA crowd had fingerprinted it — the reasoning quirks, the tool-calling style, the subtle tells of the GLM family. Five days later, Zhipu AI (now branded Z.ai) confirmed what everyone already knew: Pony Alpha was GLM-5, and it was coming for the frontier.

The stealth launch was a nice piece of theatre, but the model itself is the real story. GLM-5 is a 744-billion parameter mixture-of-experts model, released under the MIT license, that trades blows with Claude Opus 4.5 and GPT-5.2 across major benchmarks — at a fraction of the cost. It’s the strongest open-weight model available today, and it’s shifting the conversation about what “frontier” actually means when anyone can download the weights.

What GLM-5 actually is

GLM-5 uses a mixture-of-experts (MoE) architecture with 256 total experts, activating 8 per token. That gives it roughly 40 billion active parameters per forward pass despite the 744B total — a favourable ratio for inference cost.

SpecGLM-5GLM-4.5 (predecessor)
Total parameters744B355B
Active parameters~40B32B
Training data28.5T tokens23T tokens
Context window200K128K
Max output tokens128K
LicenseMITMIT

The jump from GLM-4.5 is roughly 2x in total scale and 1.24x in training data, with the context window stretching from 128K to 200K. Zhipu has maintained the MIT license, which means fully permissive commercial use — no registration, no usage restrictions, no community license asterisks.

The benchmark picture

GLM-5 posts genuinely strong numbers across reasoning, coding, and agentic benchmarks. Here’s where it lands relative to the current frontier:

BenchmarkGLM-5Claude Opus 4.5GPT-5.2
SWE-bench Verified77.8%80.9%75.4%
HLE (text only)30.528.435.4
HLE (with tools)50.443.445.5
AIME 2026 I92.793.3
HMMT Nov. 202596.991.7
BrowseComp75.965.8

The BrowseComp result (agent-based web browsing) is particularly notable — GLM-5 leads all models. On Humanity’s Last Exam with tool use, it tops both Claude and GPT. The SWE-bench gap to Opus 4.5 is real but narrow at 3.1 percentage points.

Artificial Analysis also reported a record-low hallucination score on their Omniscience Index — a 35-point improvement over GLM-4.5. The model appears to be genuinely better at knowing when to say “I don’t know” rather than fabricating answers.

The honest caveat: benchmark saturation is making these comparisons increasingly meaningless. When multiple models score above 90% on AIME, the remaining differences tell you more about evaluation methodology than model capability. Real-world usage is where the gaps become tangible.

Two things that stand out technically

Slime: asynchronous reinforcement learning

Traditional RL training for language models has an ugly bottleneck. The model generates a response, the response gets evaluated, the reward signal feeds back — and during generation, the training infrastructure sits idle. Zhipu claims this wastes over 90% of training time.

Their solution is an infrastructure called Slime that decouples data generation from model training entirely. Trajectories are generated independently and asynchronously, with a technique called Active Partial Rollouts (APRIL) that enables fine-grained iteration on complex multi-step behaviours. The result is faster RL cycles and — Zhipu claims — better agentic performance because the model gets more training signal from extended task completion rather than single-turn evaluation.

This is the kind of infrastructure innovation that doesn’t show up in architecture diagrams but materially affects what the model can do. VentureBeat described it as learning through “long-term internship-style project completion” rather than memorisation.

DeepSeek Sparse Attention

GLM-5 borrows DeepSeek’s Sparse Attention mechanism to handle its 200K+ context window without the quadratic attention cost that normally makes long-context inference prohibitively expensive. This isn’t novel to GLM-5 — DeepSeek pioneered it — but the integration is clean and it’s a significant part of why the model can offer 200K context at competitive pricing.

The economics argument

This is where GLM-5 gets genuinely disruptive.

ProviderInput (per M tokens)Output (per M tokens)
GLM-5 (Z.ai)$1.00$3.20
GLM-5 (OpenRouter)$0.80$2.56
GPT-5$1.25$10.00
Claude Opus 4.6HigherHigher

On output tokens — where most of the cost accumulates in coding and agentic workflows — GLM-5 through OpenRouter is roughly 4x cheaper than GPT-5. One Hacker News user reported refactoring code in a custom MOO language, with GLM-5 analysing APIs and example code with “basically no mistakes,” for a total cost of $1.50. They subsequently cancelled their Anthropic subscription.

An HN commenter captured the shift well: users now define “best model” as “the smallest, fastest, cheapest one that solves the job” rather than whatever tops the leaderboard. When the performance gap between frontier models is measured in single-digit percentage points but the price gap is measured in multiples, the economics win.

The Huawei question

Multiple third-party reports claim GLM-5 was trained entirely on 100,000 Huawei Ascend 910C processors using the MindSpore framework — zero NVIDIA hardware. If true, this is a significant milestone for China’s semiconductor self-sufficiency narrative and a data point in the ongoing US chip export restrictions debate.

The caveat: Hacker News commenters pointed out that this claim appears primarily in third-party coverage, not in Zhipu’s own official communications. Zhipu’s official materials confirm Ascend support for inference, which is a meaningfully different claim from training. The distinction matters — inference on domestic chips is an engineering achievement but not unprecedented. Training a frontier model end-to-end on them would be.

Until Zhipu publishes a technical report (none exists yet as of writing), the training infrastructure story remains partially unverified.

Running it yourself

GLM-5 is available through multiple channels, from managed APIs to local deployment:

Cloud APIs — the path of least resistance:

from openai import OpenAI

# Works with Z.ai, OpenRouter, or local vLLM
client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://openrouter.ai/api/v1"
)

response = client.chat.completions.create(
    model="z-ai/glm-5",
    messages=[
        {"role": "user", "content": "Explain sparse attention in transformers."}
    ]
)

The model is available on Z.ai’s own platform, OpenRouter, Together AI, and Modal. It uses an OpenAI-compatible API, so swapping it into existing code is a one-line base URL change.

Local inference — for those with the hardware:

The full BF16 model requires ~1,490GB of memory. The FP8 variant fits on 8x H200 GPUs via vLLM or SGLang. Community quantisations bring the disk footprint down dramatically — Unsloth’s GGUF quants range from 1.65TB (BF16) down to 176GB (1-bit) — but you’re trading quality for accessibility at the lower end.

# vLLM deployment on 8x H200
vllm serve zai-org/GLM-5-FP8 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.85 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

The self-hosting maths are brutal for most. One commenter calculated that an M3 Ultra Mac Studio at $10K would need roughly 30 years of continuous inference to break even against API pricing. Unless you have specific data sovereignty requirements or H200-class hardware already racked, the API is the right call.

Ollama offers a cloud-routed glm-5:cloud option that gives you the Ollama interface without local compute, which is a reasonable middle ground for experimentation.

What the community actually thinks

The reception has been cautiously positive, with the emphasis on “cautiously.”

What people like: The MIT license gets universal praise. The pricing undercuts everything at the frontier. The hallucination improvements are real and measurable. Several developers have publicly switched workflows — one cancelled their Anthropic subscription after 18 months, another now uses a multi-model approach with GLM-5 as their primary workhorse for speed and tool-calling.

What people worry about: A safety researcher at Andon Labs flagged that GLM-5 is “incredibly effective” but “far less situationally aware” than Western frontier models — it pursues goals aggressively without reasoning about its own context. The distillation question looms: how much of the performance comes from training on outputs of Claude and GPT? The SWE-bench gap to Opus, while narrow, is real. And generated output can be “a bit rough” visually, according to hands-on testing from 36Kr.

The bigger picture: A Stanford study found Chinese models typically lag seven months behind Western counterparts. GLM-5 arrived roughly three months after the latest Anthropic and OpenAI releases, meaningfully closing that gap. Zhipu’s stock jumped 32% on the announcement. DeepSeek, which held the Chinese open-source crown, appears to be losing ground — The Decoder notes it’s falling behind both GLM-5 and Moonshot’s Kimi K2.5.

Practical takeaways

  • If you’re evaluating frontier models for production: GLM-5 belongs in the comparison set. The SWE-bench gap to Opus is small enough that the 4-10x price difference could easily tip the decision for many workloads.
  • If you care about open weights: This is the new high-water mark. MIT-licensed, 744B parameters, competitive with proprietary frontier models. The ecosystem is moving fast — 8 community fine-tunes and 13 quantisation variants within three days of release.
  • If you want to try it right now: OpenRouter is the quickest path. Point any OpenAI-compatible client at z-ai/glm-5 and go. The Z.ai platform at chat.z.ai offers a ChatGPT-style interface if you just want to kick the tyres.
  • If you’re waiting for the technical report: You’ll be waiting. Zhipu hasn’t published one yet, and the Huawei training claims remain unverified. Treat the benchmarks as credible but the infrastructure narrative with appropriate scepticism.

The frontier is getting crowded, and it’s getting cheap. That’s good for everyone building on top of these models.

👤

Written by

Daniel Dewhurst

Lead AI Solutions Engineer building with AI, Laravel, TypeScript, and the craft of software.