vLLM/Recipes
NVIDIA

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Mamba2-Transformer hybrid MoE omnimodal model (31B total / 3B active) with unified video, audio, image, and text understanding; reasoning + tool calling; BF16, FP8, and NVFP4 variants

View on HuggingFace
moe31B / 3B262,144 ctxvLLM 0.20.0+multimodaltext
Guide

Overview

NVIDIA Nemotron-3-Nano-Omni-30B-A3B-Reasoning is a Mamba2-Transformer hybrid MoE omnimodal model (31B total / 3B active) that unifies video, audio, image, and text understanding. It is built on the Nemotron-3-Nano-30B-A3B LLM backbone with a CRADIO v4-H vision encoder and a Parakeet speech encoder, and ships in BF16, ModelOpt FP8, and ModelOpt NVFP4 variants.

Capabilities:

  • Video (mp4, up to 2 minutes, sampled at 1–2 FPS / 128–256 frames)
  • Audio (wav, mp3, up to 1 hour, ≥8 kHz)
  • Image (jpeg, png)
  • Text (English, up to 256K context)
  • Reasoning with chain-of-thought (<think> tags)
  • Tool calling
  • Word-level timestamps for transcription

Prerequisites

  • vLLM 0.20.0 (pinned: pip install vllm[audio]==0.20.0, or pull vllm/vllm-openai:v0.20.0)
  • Hardware: 1× B200 / H200 / H100 (single-GPU TP1 is the documented profile)
  • The audio extra is required for any audio input, including use_audio_in_video=true

Pull the Docker image

# CUDA 13:
docker pull vllm/vllm-openai:v0.20.0
# CUDA 12.9:
docker pull vllm/vllm-openai:v0.20.0-cu129

Launch command

General single-GPU invocation (B200 / H200 / H100):

vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
  --served-model-name nemotron \
  --host 0.0.0.0 \
  --port 5000 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --trust-remote-code \
  --video-pruning-rate 0.5 \
  --media-io-kwargs '{"video": {"num_frames": 512, "fps": 1}}' \
  --reasoning-parser nemotron_v3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Swap the model id for the FP8 or NVFP4 checkpoint and add --kv-cache-dtype fp8:

vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
  --kv-cache-dtype fp8 \
  ...

Platform-specific notes

  • RTX Pro: append --moe-backend triton (FlashInfer + RTX Pro bug).
  • NVFP4 + TP>1: append --moe-backend flashinfer_cutlass (TRTLLM_GEN MoE kernel bug at TP>1 on NVFP4).
  • DGX Spark (aarch64): unified LPDDR5X memory; lower --gpu-memory-utilization to 0.70 and reduce --max-model-len (e.g. 32768) if you hit OOM. Use --max-num-seqs 8.
Modetemperaturetop_ptop_kmax_tokensreasoning_budget
Thinking0.60.952048016384
Instruct0.211024

Toggle thinking mode via chat_template_kwargs={"enable_thinking": true} (default) or false to disable.

Client usage

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="")
resp = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

Benchmarking

vllm bench serve \
  --model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
  --trust-remote-code \
  --dataset-name random \
  --random-input-len 1024 --random-output-len 1024 \
  --num-warmups 20 --ignore-eos \
  --max-concurrency 1024 --num-prompts 2048

References