zai-org/GLM-4.5V
GLM-4.5 vision-language MoE model (~107B parameters, BF16) with image-text-to-text capability, 64K context, expert parallelism, and native FP8
View on HuggingFaceGuide
Overview
GLM-4.5V is the vision-language variant of GLM-4.5. It is an MoE model with ~107B total parameters that accepts image and text inputs. FP8 models have minimal accuracy loss versus BF16 and are recommended for cost-efficient serving. GLM-4.5V supports a 64K context length (use GLM-4.6V for 128K).
Prerequisites
- vLLM version: >= 0.12.0
- Hardware: 4x H100/H200 (BF16 or FP8)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launching the Server
Tensor Parallel + Expert Parallel (FP8 on 4 GPUs)
vllm serve zai-org/GLM-4.5V-FP8 \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--enable-expert-parallel \
--allowed-local-media-path / \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm
Tuning Tips
--max-model-len=65536is near the model's max context (64K).--max-num-batched-tokens=32768for prompt-heavy workloads.--gpu-memory-utilization=0.95maximizes KV cache.--mm-encoder-tp-mode dataruns the vision encoder data-parallel — preferable to TP since the encoder is small and TP adds communication overhead.--mm-processor-cache-type shmenables shared-memory caching for repeated image inputs.
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-4.5V-FP8",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
{"type": "text", "text": "Describe the image."}
]
}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Benchmarking
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model zai-org/GLM-4.5V-FP8 \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--request-rate 20
Troubleshooting
- Vision encoder overhead: Use
--mm-encoder-tp-mode dataunless TP-sharded encoder is known-good for your config. - Context length errors: GLM-4.5V max is 64K; use GLM-4.6V if you need 128K.