zai-org/GLM-4.6V
GLM-4.6 vision-language MoE model — image-text-to-text with 128K context, native FP8 checkpoint, and expert parallelism
View on HuggingFaceGuide
Overview
GLM-4.6V is an updated vision-language MoE model from Z-AI. It supports a 128K context length (vs 64K for GLM-4.5V). Native FP8 is recommended for cost-efficient serving, matching BF16 accuracy to within a small margin.
Prerequisites
- vLLM version: >= 0.12.0
- Hardware: 4x H100/H200 (BF16 or FP8)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launching the Server
Tensor Parallel + Expert Parallel (FP8 on 4 GPUs)
vllm serve zai-org/GLM-4.6V-FP8 \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--enable-expert-parallel \
--allowed-local-media-path / \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm
Tuning Tips
--max-model-len=65536is a common default; you can push to 131072.--max-num-batched-tokens=32768for prompt-heavy workloads.--mm-encoder-tp-mode data+--mm-processor-cache-type shmfor efficient vision processing.
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-4.6V-FP8",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
{"type": "text", "text": "Describe the image."}
]
}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Benchmarking
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model zai-org/GLM-4.6V-FP8 \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--request-rate 20
Troubleshooting
- Vision encoder overhead: Prefer
--mm-encoder-tp-mode dataover TP for the encoder. - Long-context memory: At 128K context, tune
--max-num-batched-tokensand--gpu-memory-utilizationto prevent OOM.