nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Mamba2-Transformer hybrid MoE omnimodal model (31B total / 3B active) with unified video, audio, image, and text understanding; reasoning + tool calling; BF16, FP8, and NVFP4 variants
View on HuggingFaceGuide
Overview
NVIDIA Nemotron-3-Nano-Omni-30B-A3B-Reasoning is a Mamba2-Transformer hybrid MoE omnimodal model (31B total / 3B active) that unifies video, audio, image, and text understanding. It is built on the Nemotron-3-Nano-30B-A3B LLM backbone with a CRADIO v4-H vision encoder and a Parakeet speech encoder, and ships in BF16, ModelOpt FP8, and ModelOpt NVFP4 variants.
Capabilities:
- Video (mp4, up to 2 minutes, sampled at 1–2 FPS / 128–256 frames)
- Audio (wav, mp3, up to 1 hour, ≥8 kHz)
- Image (jpeg, png)
- Text (English, up to 256K context)
- Reasoning with chain-of-thought (
<think>tags) - Tool calling
- Word-level timestamps for transcription
Prerequisites
- vLLM 0.20.0 (pinned:
pip install vllm[audio]==0.20.0, or pullvllm/vllm-openai:v0.20.0) - Hardware: 1× B200 / H200 / H100 (single-GPU TP1 is the documented profile)
- The
audioextra is required for any audio input, includinguse_audio_in_video=true
Pull the Docker image
# CUDA 13:
docker pull vllm/vllm-openai:v0.20.0
# CUDA 12.9:
docker pull vllm/vllm-openai:v0.20.0-cu129
Launch command
General single-GPU invocation (B200 / H200 / H100):
vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
--served-model-name nemotron \
--host 0.0.0.0 \
--port 5000 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--trust-remote-code \
--video-pruning-rate 0.5 \
--media-io-kwargs '{"video": {"num_frames": 512, "fps": 1}}' \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Swap the model id for the FP8 or NVFP4 checkpoint and add --kv-cache-dtype fp8:
vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 \
--kv-cache-dtype fp8 \
...
Platform-specific notes
- RTX Pro: append
--moe-backend triton(FlashInfer + RTX Pro bug). - NVFP4 + TP>1: append
--moe-backend flashinfer_cutlass(TRTLLM_GEN MoE kernel bug at TP>1 on NVFP4). - DGX Spark (aarch64): unified LPDDR5X memory; lower
--gpu-memory-utilizationto 0.70 and reduce--max-model-len(e.g. 32768) if you hit OOM. Use--max-num-seqs 8.
Recommended sampling
| Mode | temperature | top_p | top_k | max_tokens | reasoning_budget |
|---|---|---|---|---|---|
| Thinking | 0.6 | 0.95 | — | 20480 | 16384 |
| Instruct | 0.2 | — | 1 | 1024 | — |
Toggle thinking mode via chat_template_kwargs={"enable_thinking": true}
(default) or false to disable.
Client usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="")
resp = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4",
messages=[{"role": "user", "content": "Hello!"}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
Benchmarking
vllm bench serve \
--model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 \
--trust-remote-code \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 \
--num-warmups 20 --ignore-eos \
--max-concurrency 1024 --num-prompts 2048