nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
NVIDIA Nemotron-3-Super Mamba-hybrid latent-MoE (~120B total / ~12B active) with BF16, FP8, and NVFP4 variants
View on HuggingFaceGuide
Overview
NVIDIA Nemotron-3-Super-120B-A12B is a hybrid-Mamba latent-MoE model (~120B total, ~12B active per token) trained for general reasoning, tool use, and agentic workflows. It supports a 1M-token context window and Multi-Token Prediction (MTP). Variants ship in BF16, FP8, and NVFP4 (Blackwell). A pre-RL Base BF16 checkpoint is also available for downstream fine-tuning.
Prerequisites
- Hardware: 4-8x H100/H200/B200/RTX Pro 6000, or DGX Spark
- vLLM >= 0.17.1
- Docker with NVIDIA Container Toolkit (recommended)
Launch commands
Reference command from the vLLM blog (BF16, 4x H100, FP8 KV cache):
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--trust-remote-code \
--served-model-name nemotron \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3
FP8 weights:
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 4 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3
NVFP4 (Blackwell only):
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--tensor-parallel-size 2 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3
Benchmarking
vllm bench serve \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
--trust-remote-code \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 \
--num-warmups 20 \
--ignore-eos \
--max-concurrency 1024 \
--num-prompts 2048