nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
NVIDIA Nemotron-3-Nano Mamba-hybrid MoE (30B total / ~3B active) with BF16 and FP8 variants
View on HuggingFaceGuide
Overview
NVIDIA Nemotron-3-Nano-30B-A3B is a hybrid-Mamba MoE model (30B total, ~3B active) with FP8 and BF16 variants. It supports DGX Spark and Jetson Thor in addition to standard Hopper/Blackwell servers.
Prerequisites
- Hardware: 1x H100/H200 or comparable; DGX Spark and Jetson Thor supported
- vLLM >= 0.11.2 (0.12.0 recommended for full support)
- Docker with NVIDIA Container Toolkit (recommended)
Pull Docker Image
docker pull --platform linux/amd64 vllm/vllm-openai:v0.12.0
docker tag vllm/vllm-openai:v0.12.0 vllm/vllm-openai:deploy
DGX Spark users can build from source (see README) or use the NGC image:
docker pull nvcr.io/nvidia/vllm:25.12.post1-py3
Jetson Thor:
docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
Launch commands
FP8 with FlashInfer MoE backend (Blackwell/Hopper):
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_FLASHINFER_MOE_BACKEND=throughput
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--trust-remote-code \
--async-scheduling \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1
BF16 (with reasoning + tool parsers — typical for Spark/Thor):
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
Key flags:
kv-cache-dtype fp8for FP8 variant,autofor BF16async-schedulingreduces host overhead between decode stepsmamba-ssm-cache-dtype float32for best accuracy,float16for speedmax-num-seqscap to match client concurrency for lower per-user latency
Benchmarking
vllm bench serve \
--model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
--trust-remote-code \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 \
--num-warmups 20 \
--ignore-eos \
--max-concurrency 1024 \
--num-prompts 2048
Troubleshooting
- Use
--kv-cache-dtype fp8only with the FP8 checkpoint. - Balance TP and
--max-num-seqsfor throughput vs. per-user latency.