nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
NVIDIA Nemotron 3 Ultra hybrid Transformer-Mamba MoE model for long-context agentic reasoning, coding, and tool use.
550B total / 55B active parameters with BF16 and NVFP4 serving paths
Guide
Overview
NVIDIA Nemotron 3 Ultra is a 550B total / 55B active hybrid Transformer-Mamba MoE model built for long-running agentic reasoning, coding, research, and tool use.
Launch command
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
--served-model-name nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B \
--trust-remote-code \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8 \
--max-num-seqs 16 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 32768 \
--enable-flashinfer-autotune \
--async-scheduling \
--speculative_config.method mtp \
--speculative_config.num_speculative_tokens 5 \
--mamba-backend triton \
--mamba-ssm-cache-dtype float32 \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder