vLLM/Recipes
NVIDIA

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

NVIDIA Nemotron-3-Super Mamba-hybrid latent-MoE (~120B total / ~12B active) with BF16, FP8, and NVFP4 variants

View on HuggingFace
moe120B / 12B262,144 ctxvLLM 0.17.1+text
Guide

Overview

NVIDIA Nemotron-3-Super-120B-A12B is a hybrid-Mamba latent-MoE model (~120B total, ~12B active per token) trained for general reasoning, tool use, and agentic workflows. It supports a 1M-token context window and Multi-Token Prediction (MTP). Variants ship in BF16, FP8, and NVFP4 (Blackwell). A pre-RL Base BF16 checkpoint is also available for downstream fine-tuning.

Prerequisites

  • Hardware: 4-8x H100/H200/B200/RTX Pro 6000, or DGX Spark
  • vLLM >= 0.17.1
  • Docker with NVIDIA Container Toolkit (recommended)

Launch commands

Reference command from the vLLM blog (BF16, 4x H100, FP8 KV cache):

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --served-model-name nemotron \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nemotron_v3

FP8 weights:

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nemotron_v3

NVFP4 (Blackwell only):

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser nemotron_v3

Benchmarking

vllm bench serve \
  --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 \
  --num-warmups 20 \
  --ignore-eos \
  --max-concurrency 1024 \
  --num-prompts 2048

References