Nemotron 3 Super: NVIDIA's 120B open-weight model built for agentic workloads
NVIDIA doesn’t do subtle launches. At GTC 2026, they dropped Nemotron 3 Super — an open-weight model with 120 billion total parameters, only 12 billion active per forward pass, a 1M-token context window, and pre-trained natively in 4-bit. It’s designed from the ground up for agentic AI: multi-step reasoning, tool use, long-context coding, the works.
What makes this interesting isn’t just the specs. It’s that NVIDIA is publishing the complete training methodology, over 10 trillion tokens of pre- and post-training data, 15 reinforcement learning environments, and full evaluation recipes. For a company that historically kept its AI models behind API walls, this is a sharp turn.
The three-way hybrid architecture
Most models pick an architecture and stick with it. Nemotron 3 Super combines three.
The backbone interleaves Mamba-2 layers (state-space models optimised for long sequences), Transformer attention layers (for precise reasoning), and a novel Latent Mixture-of-Experts routing system. Each component pulls its weight differently:
- Mamba-2 handles the bulk of sequence processing. State-space models scale linearly with context length rather than quadratically, which is how you get a workable 1M-token window without melting your GPU budget. NVIDIA claims these layers are 4x more compute-efficient than standard attention.
- Transformer layers are placed strategically where the model needs genuine cross-token reasoning. You don’t need full attention everywhere — just where it counts.
- LatentMoE is the genuinely new piece. Instead of routing full-dimensional token representations to experts, each token gets projected into a lower-dimensional latent space first. This shrinks the all-to-all communication traffic by a factor of d/l (hidden dimension over latent dimension). The savings get reinvested into activating 4x more experts per token at roughly the same inference cost.
The result: 120.6B total parameters, but only 12.7B active on any given forward pass. You get the knowledge capacity of a much larger model with the inference cost of a much smaller one.
On top of this, Nemotron 3 Super includes Multi-Token Prediction (MTP) heads — auxiliary prediction layers that function as a built-in draft model for speculative decoding. No separate draft model needed. On SPEED-Bench, it achieves an average acceptance length of 3.45 tokens per verification step (compared to 2.70 for DeepSeek-R1), which translates to real wall-clock speedups of up to 3x.
Training in 4-bit from the start
Here’s where it gets genuinely unusual. Most models train in BF16 or FP32, then get quantised down to 8-bit or 4-bit for deployment. Each step down costs accuracy. You’re fitting a high-precision model into a low-precision box.
Nemotron 3 Super flips this. It’s pre-trained natively in NVFP4 — NVIDIA’s 4-bit floating-point format designed for Blackwell GPUs. The format uses E2M1 elements with 16-element micro-blocks. The model learns to be accurate within 4-bit arithmetic constraints from the beginning, rather than having precision stripped away after the fact.
Not everything runs in FP4 — attention projections, latent projections, MTP layers, and the final 15% of the network use BF16 or MXFP8 for stability. But the bulk of computation happens in 4-bit.
The practical payoff: on Blackwell hardware, NVFP4 inference runs 4x faster than FP8 models on Hopper with no accuracy loss. That’s a generational leap in inference economics, not just a marginal improvement.
The throughput numbers
Raw benchmark accuracy only tells half the story. For agentic workloads — where a model might get called hundreds of times inside a single pipeline — throughput matters as much as quality.
NVIDIA’s headline claims:
- 5x throughput over the previous Nemotron Super model
- 2.2x throughput over GPT-OSS-120B (OpenAI’s open-source 120B)
- 7.5x throughput over Qwen3.5-122B
These come from the combination of Mamba-2 efficiency, LatentMoE routing, MTP speculative decoding, and native FP4 training. Each piece contributes independently; stacked together, the throughput advantage is substantial.
On accuracy benchmarks, the picture is more nuanced:
| Benchmark | Nemotron 3 Super | GPT-OSS-120B | Qwen3.5-122B |
|---|---|---|---|
| SWE-Bench Verified | 60.47% | 41.90% | 66.40% |
| LiveCodeBench | 81.19 | — | 78.93 |
| HMMT (math) | 93.67 | — | 91.40 |
| RULER (1M context) | 91.75% | 22.30% | — |
| MMLU-Pro | 83.73 | — | 86.70 |
| GPQA | 79.23 | — | 86.60 |
It leads on math, code execution, and long-context reliability. Qwen3.5-122B still edges it on knowledge-heavy benchmarks like MMLU-Pro and GPQA. The real differentiator is the throughput-per-quality ratio — you’re getting competitive accuracy at a fraction of the compute cost.
The model also holds the top position on the DeepResearch Bench and DeepResearch Bench II leaderboards, which test multi-step research and synthesis tasks.
Running it yourself
Hardware requirements
Despite 120B total parameters, the MoE architecture keeps memory requirements manageable. NVIDIA lists these as supported configurations:
- 1x B200 or GB200 (Blackwell — ideal)
- 2x H100 or 1x H200 (Hopper)
- 4x A100 or 4x L40S (Ampere/Ada)
- 1x DGX Spark or 1x RTX 6000
For the GGUF quantised variants, expect to need 64–72GB of RAM at 4-bit precision.
API access
The fastest path is through NVIDIA’s NIM endpoint:
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="$NVIDIA_API_KEY"
)
response = client.chat.completions.create(
model="nvidia/nemotron-3-super-120b-a12b",
messages=[{"role": "user", "content": "Your prompt here"}],
max_tokens=4096
)It’s also available on OpenRouter, DeepInfra ($0.10/M input, $0.50/M output tokens), Together AI, Perplexity, Fireworks AI, and Baseten. Cloud marketplace availability includes Google Cloud Vertex AI and Oracle Cloud, with AWS Bedrock and Azure coming.
Local deployment with vLLM
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--tensor-parallel-size 4 \
--max-model-len 131072 \
--enable-reasoning \
--reasoning-parser deepseek_r1This runs the FP8 variant across 4 GPUs. The --enable-reasoning flag activates the model’s chain-of-thought mode.
Running with llama.cpp
Unsloth has published GGUF quantisations:
llama-cli \
-m NVIDIA-Nemotron-3-Super-120B-A12B-Q4_K_M.gguf \
-c 8192 \
-p "Your prompt here"TensorRT-LLM deployment is also supported via NVIDIA’s official cookbook in the Nemotron repository, using trtllm-serve for optimised Blackwell/Hopper inference.
What the community actually thinks
Reddit’s r/LocalLLaMA and Hacker News threads paint a more mixed picture than the benchmarks suggest.
The good: Speed is real. Users report 478 tokens/sec in optimised configurations. The Mamba-2 hybrid architecture delivers on the throughput promises. Greptile, who’s using it for automated code review, says it “punches above its weight” — producing useful reviews in 12.5 seconds. Tool calling and task compliance get high marks across multiple reports.
The less good: Several users on r/LocalLLaMA report basic coding quality issues — “simple syntax errors” in generated code that you wouldn’t expect from a model at this benchmark level. The hardware barrier is real; 64GB+ RAM for quantised local inference puts it out of reach for casual local deployment. And a recurring theme: for general-purpose local use, it doesn’t clearly beat Qwen3.5 despite the throughput advantage.
The nuanced take: Nemotron 3 Super seems optimised for a specific workload profile — agentic pipelines where you need fast, structured responses across many sequential calls. If you’re using it as a general chat model or expecting Sonnet-quality creative writing, you’ll probably be disappointed. If you’re building multi-agent systems where throughput and tool use reliability matter more than peak accuracy on any single query, the value proposition is much clearer.
Daily.co benchmarked it for voice AI applications and found it performs at GPT-5.4 level on long-conversation voice agent benchmarks — another sign that it’s built for sustained, multi-turn interaction rather than one-shot brilliance.
The openness angle
This is arguably the most significant part of the release, and it’s easy to overlook next to the architecture innovation.
NVIDIA is publishing:
- 10+ trillion tokens of pre- and post-training datasets
- 15 training environments for reinforcement learning
- Complete evaluation recipes and training methodology
- Pre-trained, post-trained, and quantised checkpoints on Hugging Face
The license is NVIDIA’s Nemotron Open Model License — commercially usable with attribution, free to create and distribute derivatives. It’s not Apache 2.0 (there’s a patent termination clause if you litigate against NVIDIA), but it’s meaningfully open for enterprise use.
For context, NVIDIA’s previous open model releases were far more restricted. Publishing the full training data and RL environments is a signal that they’re competing on the training methodology level, not just the weights. If you have the compute, you could in theory reproduce or adapt the entire training pipeline.
This matters because the agentic AI space is moving toward compound systems — orchestration layers that route between multiple specialised models. Open weights and documented training pipelines mean you can fine-tune Nemotron 3 Super for your specific agent tasks rather than treating it as a fixed black box.
Where it fits
NVIDIA positions Nemotron 3 Super within a tiered family: Nano (30B total / 3B active) handles targeted individual steps, Super manages complex multi-step planning, and an Ultra model is still to come.
The practical sweet spot seems clear: if you’re building multi-agent systems that need to process large codebases, research corpora, or document sets in a single context window, the combination of 1M-token context, 12B active parameters, and 7.5x throughput advantage over Qwen3.5-122B is compelling. Early adopters like Perplexity (search), CodeRabbit (code review), Factory and Greptile (software dev agents), and enterprise players like Siemens, Palantir, and Dassault Systèmes point to where the value lands.
For local enthusiasts running models on consumer hardware, the story is less exciting. The 64GB+ RAM floor and reported coding inconsistencies mean Qwen3.5 and other models remain competitive choices for general-purpose local use.
The real bet NVIDIA is making here isn’t on any single model’s benchmark scores. It’s on the idea that agentic workloads need purpose-built models — optimised for throughput, tool use, and long context rather than raw chat quality. Whether that thesis holds depends on how quickly multi-agent architectures move from research prototypes to production systems. Given the trajectory of tools like Claude Code, Devin, and Codex, I’d say the timing is right.