vLLM/Recipes
NVIDIA

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

NVIDIA Nemotron-3-Nano 4B (Mamba-hybrid dense) — compact reasoning + tool-use model with BF16 and FP8 variants

View on HuggingFace
dense4B262,144 ctxvLLM 0.11.2+text
Guide

Overview

Nemotron-3-Nano-4B is the smallest member of the Nemotron-3 hybrid-Mamba family. It's tuned for low-latency reasoning, tool calling, and edge deployments — DGX Spark and Jetson Thor are both supported alongside standard Hopper/Blackwell servers.

Prerequisites

  • Hardware: 1x H100/H200/B200, DGX Spark, or Jetson Thor
  • vLLM >= 0.11.2 (0.12.0 recommended)
  • Docker with NVIDIA Container Toolkit (recommended)

Pull Docker Image

docker pull --platform linux/amd64 vllm/vllm-openai:v0.12.0

Jetson Thor:

docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor

Launch commands

BF16:

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/resolve/main/nano_v3_reasoning_parser.py

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
  --trust-remote-code \
  --async-scheduling \
  --max-model-len 262144 \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

FP8:

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
  --trust-remote-code \
  --async-scheduling \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1

Benchmarking

vllm bench serve \
  --model nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 \
  --num-warmups 20 \
  --ignore-eos \
  --max-concurrency 256 \
  --num-prompts 1024

References