vLLM/Recipes
NVIDIA

nvidia/NVIDIA-Nemotron-Nano-9B-v2

NVIDIA Nemotron-Nano 9B (Mamba-hybrid dense) reasoning + tool-use model with FP8 / NVFP4 / Japanese variants

View on HuggingFace
dense9B131,072 ctxvLLM 0.10.1+text
Guide

Overview

Nemotron-Nano-9B-v2 is a 9B Mamba-hybrid dense reasoning model that runs on a single H100/H200/B200. Variants ship in BF16, FP8, and NVFP4; a Japanese-specialized fine-tune and a pre-RL Base checkpoint are also available.

Prerequisites

  • Hardware: 1x H100/H200/B200 (or comparable)
  • vLLM >= 0.10.1 (pip install -U vllm)
  • Docker with NVIDIA Container Toolkit (recommended)

Launch command

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2/resolve/main/nemotron_toolcall_parser_no_streaming.py

vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
  --trust-remote-code \
  --max-model-len 131072 \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser-plugin nemotron_toolcall_parser_no_streaming.py \
  --tool-call-parser nemotron_json

FP8:

vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 1

NVFP4 (Blackwell only):

vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 1

Benchmarking

vllm bench serve \
  --model nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 \
  --num-warmups 20 \
  --ignore-eos \
  --max-concurrency 256 \
  --num-prompts 1024

References