vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-4.5V

GLM-4.5 vision-language MoE model (~107B parameters, BF16) with image-text-to-text capability, 64K context, expert parallelism, and native FP8

View on HuggingFace
moe107B / 12B65,536 ctxvLLM 0.12.0+multimodal
Guide

Overview

GLM-4.5V is the vision-language variant of GLM-4.5. It is an MoE model with ~107B total parameters that accepts image and text inputs. FP8 models have minimal accuracy loss versus BF16 and are recommended for cost-efficient serving. GLM-4.5V supports a 64K context length (use GLM-4.6V for 128K).

Prerequisites

  • vLLM version: >= 0.12.0
  • Hardware: 4x H100/H200 (BF16 or FP8)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launching the Server

Tensor Parallel + Expert Parallel (FP8 on 4 GPUs)

vllm serve zai-org/GLM-4.5V-FP8 \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --enable-expert-parallel \
     --allowed-local-media-path / \
     --mm-encoder-tp-mode data \
     --mm-processor-cache-type shm

Tuning Tips

  • --max-model-len=65536 is near the model's max context (64K).
  • --max-num-batched-tokens=32768 for prompt-heavy workloads.
  • --gpu-memory-utilization=0.95 maximizes KV cache.
  • --mm-encoder-tp-mode data runs the vision encoder data-parallel — preferable to TP since the encoder is small and TP adds communication overhead.
  • --mm-processor-cache-type shm enables shared-memory caching for repeated image inputs.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-4.5V-FP8",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
            {"type": "text", "text": "Describe the image."}
        ]
    }],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Benchmarking

vllm bench serve \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model zai-org/GLM-4.5V-FP8 \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --request-rate 20

Troubleshooting

  • Vision encoder overhead: Use --mm-encoder-tp-mode data unless TP-sharded encoder is known-good for your config.
  • Context length errors: GLM-4.5V max is 64K; use GLM-4.6V if you need 128K.

References