nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16
NVIDIA Nemotron-3-Nano 4B (Mamba-hybrid dense) — compact reasoning + tool-use model with BF16 and FP8 variants
View on HuggingFaceGuide
Overview
Nemotron-3-Nano-4B is the smallest member of the Nemotron-3 hybrid-Mamba family. It's tuned for low-latency reasoning, tool calling, and edge deployments — DGX Spark and Jetson Thor are both supported alongside standard Hopper/Blackwell servers.
Prerequisites
- Hardware: 1x H100/H200/B200, DGX Spark, or Jetson Thor
- vLLM >= 0.11.2 (0.12.0 recommended)
- Docker with NVIDIA Container Toolkit (recommended)
Pull Docker Image
docker pull --platform linux/amd64 vllm/vllm-openai:v0.12.0
Jetson Thor:
docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
Launch commands
BF16:
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/resolve/main/nano_v3_reasoning_parser.py
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
--trust-remote-code \
--async-scheduling \
--max-model-len 262144 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin nano_v3_reasoning_parser.py \
--reasoning-parser nano_v3
FP8:
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
--trust-remote-code \
--async-scheduling \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1
Benchmarking
vllm bench serve \
--model nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8 \
--trust-remote-code \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 \
--num-warmups 20 \
--ignore-eos \
--max-concurrency 256 \
--num-prompts 1024