vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-4.6V

GLM-4.6 vision-language MoE model — image-text-to-text with 128K context, native FP8 checkpoint, and expert parallelism

View on HuggingFace
moe107B / 12B131,072 ctxvLLM 0.12.0+multimodal
Guide

Overview

GLM-4.6V is an updated vision-language MoE model from Z-AI. It supports a 128K context length (vs 64K for GLM-4.5V). Native FP8 is recommended for cost-efficient serving, matching BF16 accuracy to within a small margin.

Prerequisites

  • vLLM version: >= 0.12.0
  • Hardware: 4x H100/H200 (BF16 or FP8)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launching the Server

Tensor Parallel + Expert Parallel (FP8 on 4 GPUs)

vllm serve zai-org/GLM-4.6V-FP8 \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --enable-expert-parallel \
     --allowed-local-media-path / \
     --mm-encoder-tp-mode data \
     --mm-processor-cache-type shm

Tuning Tips

  • --max-model-len=65536 is a common default; you can push to 131072.
  • --max-num-batched-tokens=32768 for prompt-heavy workloads.
  • --mm-encoder-tp-mode data + --mm-processor-cache-type shm for efficient vision processing.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-4.6V-FP8",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
            {"type": "text", "text": "Describe the image."}
        ]
    }],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Benchmarking

vllm bench serve \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model zai-org/GLM-4.6V-FP8 \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --request-rate 20

Troubleshooting

  • Vision encoder overhead: Prefer --mm-encoder-tp-mode data over TP for the encoder.
  • Long-context memory: At 128K context, tune --max-num-batched-tokens and --gpu-memory-utilization to prevent OOM.

References