vLLM/Recipes
NVIDIA

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16

NVIDIA Nemotron 3 Ultra hybrid Transformer-Mamba MoE model for long-context agentic reasoning, coding, and tool use.

550B total / 55B active parameters with BF16 and NVFP4 serving paths

moe550B / 55B262,144 ctxvLLM 0.22.0+text
Guide

Overview

NVIDIA Nemotron 3 Ultra is a 550B total / 55B active hybrid Transformer-Mamba MoE model built for long-running agentic reasoning, coding, research, and tool use.

Launch command

export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1

vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --served-model-name nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --kv-cache-dtype fp8 \
  --max-num-seqs 16 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 32768 \
  --enable-flashinfer-autotune \
  --async-scheduling \
  --speculative_config.method mtp \
  --speculative_config.num_speculative_tokens 5 \
  --mamba-backend triton \
  --mamba-ssm-cache-dtype float32 \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

References