nvidia/Cosmos3-Super-Text2Image
64B Cosmos3-Super specialization for high-fidelity text-to-image generation
High-fidelity text-to-image — 8×H100 with CFG-parallel + ulysses + HSDP
Guide
Overview
Cosmos3-Super-Text2Image is a 64B specialization of NVIDIA's Cosmos3-Super omnimodal world model, tuned for high-fidelity text-to-image generation. It is served via vLLM-Omni.
For best quality, text prompts should be JSON-upsampled — the model repo ships
an agentic_upsampling/ package and an AGENTIC_UPSAMPLING.md guide.
Prerequisites
- 8× H100 for the documented full-node profile (4×H200 / 4×GB200 alternative below)
- The release-tested
vllm/vllm-omni:cosmos3container, or vLLM-Omni installed on top ofvllm==0.21.0
Launch command (vLLM-Omni)
Recommended configuration on an 8×H100 node:
docker pull vllm/vllm-omni:cosmos3
vllm serve nvidia/Cosmos3-Super-Text2Image \
--omni \
--host 0.0.0.0 \
--port 8000 \
--cfg-parallel-size 2 \
--ulysses-degree 4 \
--tensor-parallel-size 1 \
--use-hsdp \
--hsdp-shard-size 8 \
--init-timeout 1800
For 4×H200 or 4×GB200, use
--cfg-parallel-size 2 --ulysses-degree 2 --tensor-parallel-size 1.
--enable-layerwise-offload can reduce VRAM usage on smaller GPUs, but for
text-to-image it incurs a significant performance penalty.
Example: text-to-image
curl -X POST http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "a dragon flying over the Green Mountains at sunset",
"size": "1024x1024",
"seed": 42
}' | jq -r '.data[0].b64_json' | base64 -d > output.png