nvidia/NVIDIA-Nemotron-Nano-9B-v2
NVIDIA Nemotron-Nano 9B (Mamba-hybrid dense) reasoning + tool-use model with FP8 / NVFP4 / Japanese variants
View on HuggingFaceGuide
Overview
Nemotron-Nano-9B-v2 is a 9B Mamba-hybrid dense reasoning model that runs on a single H100/H200/B200. Variants ship in BF16, FP8, and NVFP4; a Japanese-specialized fine-tune and a pre-RL Base checkpoint are also available.
Prerequisites
- Hardware: 1x H100/H200/B200 (or comparable)
- vLLM >= 0.10.1 (
pip install -U vllm) - Docker with NVIDIA Container Toolkit (recommended)
Launch command
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2/resolve/main/nemotron_toolcall_parser_no_streaming.py
vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
--trust-remote-code \
--max-model-len 131072 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser-plugin nemotron_toolcall_parser_no_streaming.py \
--tool-call-parser nemotron_json
FP8:
vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8 \
--trust-remote-code \
--tensor-parallel-size 1
NVFP4 (Blackwell only):
vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 1
Benchmarking
vllm bench serve \
--model nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
--trust-remote-code \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 \
--num-warmups 20 \
--ignore-eos \
--max-concurrency 256 \
--num-prompts 1024