nvidia/Cosmos3-Super-Text2Image

64B Cosmos3-Super specialization for high-fidelity text-to-image generation

High-fidelity text-to-image — 8×H100 with CFG-parallel + ulysses + HSDP

View on HuggingFace

dense64B262,144 ctxvLLM 0.21.0+vLLM-Omninightlyomni

Guide

Overview

Cosmos3-Super-Text2Image is a 64B specialization of NVIDIA's Cosmos3-Super omnimodal world model, tuned for high-fidelity text-to-image generation. It is served via vLLM-Omni.

For best quality, text prompts should be JSON-upsampled — the model repo ships an agentic_upsampling/ package and an AGENTIC_UPSAMPLING.md guide.

Prerequisites

8× H100 for the documented full-node profile (4×H200 / 4×GB200 alternative below)
The release-tested vllm/vllm-omni:cosmos3 container, or vLLM-Omni installed on top of vllm==0.21.0

Launch command (vLLM-Omni)

Recommended configuration on an 8×H100 node:

docker pull vllm/vllm-omni:cosmos3

vllm serve nvidia/Cosmos3-Super-Text2Image \
  --omni \
  --host 0.0.0.0 \
  --port 8000 \
  --cfg-parallel-size 2 \
  --ulysses-degree 4 \
  --tensor-parallel-size 1 \
  --use-hsdp \
  --hsdp-shard-size 8 \
  --init-timeout 1800

For 4×H200 or 4×GB200, use --cfg-parallel-size 2 --ulysses-degree 2 --tensor-parallel-size 1. --enable-layerwise-offload can reduce VRAM usage on smaller GPUs, but for text-to-image it incurs a significant performance penalty.

Example: text-to-image

curl -X POST http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a dragon flying over the Green Mountains at sunset",
    "size": "1024x1024",
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > output.png

Overview

Prerequisites

Launch command (vLLM-Omni)

Example: text-to-image

References