nvidia/Cosmos3-Nano
Compact 16B omnimodal world model (Mixture-of-Transformers) for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI
16B omnimodal world model — single-GPU H200 video/audio generation
Guide
Overview
Cosmos3-Nano is the compact (16B) member of NVIDIA's Cosmos3 family of omnimodal world models. Built on a Mixture-of-Transformers (MoT) architecture — an autoregressive tower for discrete-token generation paired with a diffusion tower for continuous modalities — it generates coherent text, images, video, audio, and action commands from combinations of text, image, video, and action-trajectory inputs. It targets Physical AI: world understanding, world simulation, future prediction, and action reasoning.
This recipe covers the vLLM-Omni serving path (video + muxed-audio generation). The same checkpoint also serves a text/multimodal Reasoner endpoint via standard vLLM — see Reasoner mode below.
Prerequisites
- 1× H200 (or H100 / A100) for the documented single-GPU profile
- The release-tested
vllm/vllm-omni:cosmos3container, or vLLM-Omni installed on top ofvllm==0.21.0 - Prompts should be JSON-upsampled for best quality (see cosmos-framework prompt upsampling)
Launch command (vLLM-Omni)
docker pull vllm/vllm-omni:cosmos3
vllm serve nvidia/Cosmos3-Nano \
--omni \
--host 0.0.0.0 \
--port 8000 \
--init-timeout 1800
To speed up inference with additional GPUs, enable context parallelism with
--ulysses-degree or switch to tensor parallelism with
--tensor-parallel-size. On memory-constrained GPUs, --enable-layerwise-offload
reduces VRAM usage at a performance cost.
Example: text-to-video
curl -X POST http://localhost:8000/v1/videos/sync \
-H "Accept: video/mp4" \
-F "prompt=$(cat assets/example_t2v_prompt.json)" \
-F "negative_prompt=$(cat assets/negative_prompt.json)" \
-F "size=1280x720" \
-F "num_frames=189" \
-F "fps=24" \
-F "num_inference_steps=35" \
-F "guidance_scale=6.0" \
-F "max_sequence_length=4096" \
-F "flow_shift=10.0" \
-F "seed=123" \
--output output.mp4
Image-to-video uses the same endpoint with an input_reference file part.
Audio is generated jointly and muxed into the output MP4 (48 kHz stereo AAC) —
there is no separate audio endpoint.
Reasoner mode (multimodal understanding)
The same checkpoint serves a text/multimodal Reasoner for world
understanding, future prediction, and action reasoning via the standard vllm
package (not vLLM-Omni):
# CUDA 13 drivers:
uv pip install --torch-backend=cu130 "vllm==0.21.0" \
"vllm-cosmos3 @ git+https://github.com/NVIDIA/cosmos-framework.git#subdirectory=packages/vllm-cosmos3" \
openai
# CUDA 12.8 drivers: use --torch-backend=cu128 "vllm==0.19.1"
CUDA_VISIBLE_DEVICES=0 \
vllm serve nvidia/Cosmos3-Nano \
--hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}' \
--tensor-parallel-size 1 \
--mm-encoder-tp-mode data \
--async-scheduling \
--allowed-local-media-path / \
--media-io-kwargs '{"video": {"num_frames": -1}}' \
--port 8000
Then query the OpenAI-compatible /v1/chat/completions endpoint with text +
image/video content.