Skip to content

MiniMax M2.5: the agent-native model that costs a dollar an hour

· 10 min read

“Intelligence too cheap to meter.” That’s the tagline MiniMax chose for M2.5, a mixture-of-experts model that scores 80.2% on SWE-Bench Verified — within 0.6 points of Claude Opus 4.6 — while charging $1.20 per million output tokens. For context, Opus charges $25.

The model dropped on February 12, and the reaction has been a predictable mix of excitement and suspicion. MiniMax’s previous releases had a reputation for benchmark gaming that bordered on performance art. But independent evaluations are painting a different picture this time, and the pricing is forcing a conversation the industry has been avoiding: what happens when frontier-class coding ability becomes a commodity?

The company

MiniMax (稀宇科技) is a Shanghai-based AI company founded in 2022 by Yan Junjie, former VP at SenseTime. Most people outside China know them through Hailuo AI, their video generation platform that produces clips at roughly 1/10th the cost of OpenAI’s Sora and supports 70+ languages.

The company listed on the Hong Kong Stock Exchange in January 2026, raising approximately HK$4.2 billion. Shares surged 109% on the first day of trading, backed by cornerstone investments from the Abu Dhabi Investment Authority, Alibaba, and Tencent. When M2.5 launched two weeks later, the stock jumped another 15.7%.

MiniMax isn’t a pure-play LLM lab. Their product surface includes video generation (Hailuo), text-to-speech (Speech 2.6), AI music (Music 2.0), and character chat platforms (Talkie internationally, Xing Ye in China). M2.5 is the flagship text model, but it sits inside a broader multimodal ecosystem — one that’s now publicly traded and generating roughly 70% of its revenue from overseas markets.

Architecture: extreme sparsity as a business model

M2.5 is a mixture-of-experts transformer with 230 billion total parameters, of which only 10 billion activate per token — a 4.3% activation ratio. To put that in perspective:

ModelTotal paramsActive paramsActivation ratio
MiniMax M2.5230B10B4.3%
DeepSeek V3671B37B5.5%
GLM-5744B40B5.4%
Claude Opus 4.6Undisclosed (dense)All100%

The MoE architecture is the entire cost story. Dense models fire every parameter on every token. MoE models route each token to a small subset of specialised expert networks, giving you the knowledge capacity of a much larger model at a fraction of the inference cost. M2.5 pushes this further than its peers — 10B active parameters is smaller than DeepSeek’s 37B or GLM-5’s 40B, which is how MiniMax can offer pricing that looks like a typo.

Two variants ship:

  • M2.5 Standard — 50 tokens/sec, $0.15 input / $1.20 output per million tokens
  • M2.5 Lightning — 100 tokens/sec, $0.30 input / $2.40 output per million tokens

MiniMax frames the Lightning variant as “$1 per hour of continuous operation.” Four instances running 24/7 for a full year would cost roughly $10,000. The Standard variant halves even that.

Forge: the RL framework that matters more than the model

The architecture isn’t new — M2, M2.1, and M2.5 all share the same 230B/10B MoE skeleton. What changed is how M2.5 was trained, and this is where MiniMax’s actual technical contribution lives.

Forge is MiniMax’s agent-native reinforcement learning framework, purpose-built for training models on multi-step tool-use tasks. The key design decisions:

Decoupled architecture. Forge separates the training/inference engine from the agent scaffolding through a middleware abstraction layer. The model doesn’t learn to use one specific tool interface — it learns to generalise across scaffolds. MiniMax claims training covered 100,000+ distinct agent configurations.

CISPO (Clipped Importance Sampling Policy Optimization). A custom RL algorithm designed for stable MoE training on long-horizon decision-making tasks. MiniMax reports roughly 2x speedup over ByteDance’s DAPO algorithm.

Process rewards for agent trajectories. Standard RL rewards the final outcome. Forge adds intermediate process rewards that monitor generation quality across long agent rollouts, balancing task completion accuracy against response speed. This addresses the credit assignment problem that makes agent RL notoriously difficult — when a 50-step trajectory fails, which step was the mistake?

Prefix tree merging. Trajectories that share common prefixes get merged into tree structures during training, achieving approximately 40x speedup in RL sample efficiency. Combined with asynchronous scheduling, this brought total M2.5 training time to roughly two months.

Scale of training environments. Forge trained M2.5 across 200,000+ real-world environments covering codebases in 10+ languages (Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, Ruby), web browsers, office applications (Word, Excel, PowerPoint), and multi-platform targets (Web, Android, iOS, Windows).

One emergent behaviour worth noting: M2.5 develops what MiniMax calls a “spec-writing tendency” — it actively decomposes and plans before writing code, producing specification documents before implementation. This wasn’t explicitly trained; it fell out of the RL process naturally.

MiniMax claims that 80% of newly committed code at their own headquarters is now M2.5-generated, with 30% of company tasks running autonomously on the model. That’s a bold claim, but it’s also a useful credibility signal — when the people who built the model trust it enough to run their own engineering on it, the benchmarks are probably directionally real.

The benchmarks

Where M2.5 leads or matches the frontier

BenchmarkM2.5Opus 4.6GPT-5.2Notes
SWE-Bench Verified80.2%80.8%80.0%Within rounding error of frontier
Multi-SWE-Bench51.3%50.3%Multi-repo engineering
SWE-Bench Pro55.4%53.4%Harder subset
Droid Harness79.7%78.9%Slightly ahead
OpenCode Harness76.1%75.9%Slightly ahead
BFCL Multi-Turn76.8%~63%13+ points ahead on function calling

The BFCL (Berkeley Function Calling Leaderboard) result is particularly notable. Multi-turn function calling — the ability to maintain context and call tools correctly across extended interactions — is arguably the most practically important capability for agent workflows. A 13-point lead over Opus here is not a rounding error.

SWE-Bench Verified runtime is also competitive: 22.8 minutes average versus Opus’s 22.9 minutes, using roughly 20% fewer agentic rounds than M2.1.

Where M2.5 falls short

BenchmarkM2.5Opus 4.6Gap
AIME 202586.3%95.6%-9.3
GPQA-Diamond85.2%90.0%-4.8
SciCode44.4%52.0%-7.6
BrowseComp76.3%84.0%-7.7
Terminal-Bench 252.0%65.4%-13.4

The pattern is clear: M2.5 is a coding and agent specialist, not a generalist. Pure mathematical reasoning (AIME), graduate-level science (GPQA), and terminal operations all show meaningful gaps. If your workload is “solve competition maths problems,” this isn’t your model.

Artificial Analysis flagged another concern: M2.5’s hallucination rate regressed compared to M2.1, with the Omniscience Index dropping from -30 to -41. The model is also notably verbose — it generated 56 million output tokens during evaluation, roughly 4x the median across models tested. You’re getting more output, but not necessarily more signal.

Independent validation

The OpenHands team — who maintain one of the most widely used open-source agent frameworks — ranked M2.5 4th overall on their index, behind only Claude Opus and GPT-5.2 Codex. They called it “the first open model to exceed Claude Sonnet’s performance” and found it capable across complex multi-step tasks including GitHub API interactions, PR analysis with git blame, and frontend debugging. Their caveat: occasional sloppiness with branch management and formatting instructions.

Artificial Analysis placed M2.5 at #4 out of 64 models on their Intelligence Index, with an output speed of 73.7 tokens/sec.

The benchmark gaming question

This is the elephant in the room. M2 and M2.1 built a reputation that M2.5 now has to overcome.

The charges against the predecessors were specific and damning. Community members documented M2.1 writing “nonsensical test reports while tests actually failed,” hardcoding test case outputs into algorithms instead of solving problems, and producing fake test suites on fabricated data before declaring applications perfectly functional. One Hacker News commenter called M2 “one of the most benchmaxxed models we’ve seen,” noting a “huge gap between SWE-B results” and real-world performance. Artificial Analysis placed M2.1’s actual coding index at 33 — far behind frontier models despite headline benchmark scores that suggested otherwise.

MiniMax has been surprisingly candid about this. Their M2.5 technical documentation acknowledges that with M1, “due to unreasonable reward design, the model consistently wrote overly simple test code, causing a large number of incorrect fix solutions to be selected.” The Forge framework’s process reward mechanism was explicitly designed to address this — rewarding genuine problem-solving trajectories rather than just final test-pass signals.

The evidence that M2.5 is different from its predecessors is circumstantial but accumulating. The OpenHands independent evaluation, Artificial Analysis scoring, and MiniMax’s self-reported internal benchmark (where M2.5 scored 59% versus Opus’s 73.5% — notably honest) all suggest that the headline numbers are at least directionally real this time. But the community’s trust deficit is earned, and only months of real-world usage will fully resolve it.

Running it yourself

M2.5 ships under a modified MIT license (the modification: commercial products must display “MiniMax M2.5” attribution in the UI). Weights are available on Hugging Face at MiniMaxAI/MiniMax-M2.5.

API access

The fastest path is through the MiniMax API or OpenRouter:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.minimax.io/v1"
)

response = client.chat.completions.create(
    model="minimax-m2.5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Implement a B-tree with insert and search in Rust."}
    ],
    temperature=1.0,
    top_p=0.95
)

MiniMax recommends temperature=1.0, top_p=0.95, top_k=40 — higher temperature than most models, reflecting the MoE routing dynamics.

Self-hosting

Despite 230B total parameters, the MoE architecture makes self-hosting more feasible than the number suggests. GGUF quantizations are available via Unsloth:

QuantizationSizeNotes
Q3_K_M~109 GBFits high-RAM consumer setups
Q4_K_M~138 GBCommon sweet spot
Q6_K~188 GBHigh quality
Q8_0~243 GBNear full precision
BF16~457 GBFull precision, multi-GPU required

For GPU inference, vLLM and SGLang are the recommended frameworks:

vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --enable-auto-tool-choice

KTransformers supports CPU-GPU heterogeneous inference, which was tested on consumer hardware (2x RTX 5090 + AMD EPYC). If you have 160GB+ of system RAM and a decent GPU, the Q4 quantization is within reach via llama.cpp.

The model is also available on OpenRouter (minimax/minimax-m2.5), Ollama (minimax-m2.5:cloud), and has a Cline integration for IDE-based coding workflows.

What M2.5 actually means

For the past year, the story of US-China AI competition was about closing the capability gap. On coding benchmarks, that gap is now a rounding error. M2.5, GLM-5, DeepSeek V3 — three Chinese open-weight models all trading blows with Opus and GPT-5.2 on the tasks that generate the most economic value.

The new contest is about cost. And here, the MoE architecture gives Chinese labs a structural advantage. When your model only activates 10 billion of its 230 billion parameters per token, you can offer pricing that dense-architecture competitors physically cannot match without destroying the margins that fund their research.

M2.5 is not the best model at everything. It’s meaningfully worse than Opus at mathematical reasoning, science, and terminal operations. It hallucinates more than its predecessor. It’s verbose. And the benchmark gaming legacy of M2/M2.1 means every headline number comes with an asterisk that only time and usage will erase.

But the right question isn’t “is M2.5 better than Opus?” It’s “is 95% of Opus on coding tasks, at 5% of the price, with open weights under MIT, good enough for your workflow?” For a growing number of developers running agent loops, batch processing, and automated code review — the answer is increasingly yes.

👤

Written by

Daniel Dewhurst

Lead AI Solutions Engineer building with AI, Laravel, TypeScript, and the craft of software.