vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-5.2

GLM-5.2 — frontier-scale MoE language model (~743B total parameters, 39B active) with up to 5-token MTP speculative decoding and thinking mode

Latest GLM-5 series MoE with extended MTP (5 draft tokens), improved reasoning and agentic performance

moe743B / 39B1,048,576 ctxvLLM 0.23.0+text
Guide

Overview

GLM-5.2 is the newest model in the GLM-5 series — a ~743B-parameter MoE (39B active) from Z-AI. The headline change over GLM-5 / 5.1 is that Multi-Token Prediction (MTP) is extended from 3 to 5 draft tokens, lifting end-to-end throughput on reasoning, coding, and agentic workloads. It ships as BF16 and native-FP8 checkpoints and keeps the GLM thinking-mode behavior.

This recipe targets the FP8 checkpoint, the practical default: it fits on a single 8xH200 / 8xH20 node and — with FP8 KV cache — reaches the full 1M-token context on 8xB200.

Prerequisites

  • vLLM 0.23.0 (stable). If you need tool calling and MTP at the same time, use the latest main branch.
  • GPU: 8xH200 or 8xH20 (141 GB each) for single-node FP8; 8xB200 (180 GB each) for the full 1M context.

Installation

Docker

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm52 zai-org/GLM-5.2-FP8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --served-model-name glm-5.2-fp8 \
    --kv-cache-dtype fp8

On CUDA 12.x, swap the image for vllm/vllm-openai:glm52-cu129.

From source

uv venv
source .venv/bin/activate
uv pip install "vllm==0.23.0" --torch-backend=auto
uv pip install "transformers>=5.9.0"

Launching the server

FP8 on 8xH200 (standard)

vllm serve zai-org/GLM-5.2-FP8 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 5 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5.2-fp8

FP8 on AMD MI300X/MI355X (full 1M context)

GLM-5.2 has a native 1M-token window. Whether the full window fits on ROCm is a KV-cache VRAM question, so the levers are --max-model-len and --max-num-seqs. Start with the values below, then scale them with your node's HBM and workload: raise the context window when startup reports KV-cache headroom, or lower the sequence cap if long prompts OOM under concurrency.

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 \
vllm serve zai-org/GLM-5.2-FP8 \
  --kv-cache-dtype fp8_e4m3 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 5 \
  --tool-call-parser glm47 \
  --enable-auto-tool-choice \
  --reasoning-parser glm45 \
  --gpu-memory-utilization 0.80 \
  --max-model-len 524288 \
  --max-num-seqs 32 \
  --linear-backend aiter \
  --moe-backend aiter
  • --max-model-len — caps the served context window; raise it toward the native 1M window when your HBM budget and workload leave KV-cache headroom.
  • --max-num-seqs 32 — the main knob for fitting long context under concurrency; start at 32 and tune it to your HBM (up on headroom, down on OOM).
  • --gpu-memory-utilization 0.80 — leaves ROCm runtime headroom for MTP graph capture and inference; raise it only after a representative concurrent smoke test.
  • --speculative-config.num_speculative_tokens 5 — enables GLM-5.2's 5-token MTP path; reduce it if your workload shows low acceptance or higher latency.
  • VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 — enables the shared-expert fused MoE path. VLLM_ROCM_USE_AITER_LINEAR and VLLM_ROCM_USE_AITER_MOE default to enabled when VLLM_ROCM_USE_AITER=1.

FP8 on 8xB200 (full 1M context)

GLM-5.2 has a native 1M-token window. Whether the full window fits is a KV-cache VRAM question, so the lever is --max-num-seqs — it bounds how many sequences share the KV budget at once, leaving room for each to hold a long context. Start at 32 and scale with your node's VRAM: raise it on larger trays (e.g. 8xB300) or short-prompt traffic, lower it if you OOM at full context. FP8 KV cache (--kv-cache-dtype fp8_e4m3, already in the base flags) roughly halves that budget, which is what makes 1M reachable at all.

VLLM_DEEP_GEMM_WARMUP=skip vllm serve zai-org/GLM-5.2-FP8 \
  --kv-cache-dtype fp8_e4m3 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 5 \
  --max-num-seqs 32 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5.2-fp8
  • --max-num-seqs 32 — the single knob for fitting 1M context; start at 32 and tune it to your VRAM (up on headroom, down on OOM).
  • VLLM_DEEP_GEMM_WARMUP=skip — skips DeepGEMM JIT warmup for a faster startup; the first few requests compile kernels on demand instead.
  • BF16 needs multi-node plus an extra loader flag — see Troubleshooting.

Reasoning modes

Thinking is on by default. GLM-5.2 reuses the DeepSeek-V4 reasoning_effort mechanism, with two effort levels driven by the reasoning_effort field:

ModeHow to requestBehavior
Think Max (default)omit reasoning_effort, or set "max"Deepest reasoning — hard math, multi-step planning, agentic tasks. Highest token cost.
Think High"reasoning_effort": "high"Balanced depth and latency.
Non-thinkchat_template_kwargs.enable_thinking: falseFast, no chain-of-thought.

The chat template resolves effort to max unless reasoning_effort is explicitly "high", so Max is the default and High is opt-in. Pass it through chat_template_kwargs (the DeepSeek-V4 path) or the top-level OpenAI reasoning_effort field; it has no effect when thinking is disabled.

Client usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
msgs = [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}]

# Think Max (default) — just omit reasoning_effort
client.chat.completions.create(model="glm-5.2-fp8", messages=msgs, max_tokens=4096)

# Think High — explicitly request reasoning_effort: "high"
client.chat.completions.create(
    model="glm-5.2-fp8",
    messages=msgs,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"reasoning_effort": "high"}},
)

# Non-think
client.chat.completions.create(
    model="glm-5.2-fp8",
    messages=msgs,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

cURL — Think High

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2-fp8",
    "messages": [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}],
    "temperature": 1,
    "max_tokens": 4096,
    "chat_template_kwargs": {"reasoning_effort": "high"}
  }'

cURL — non-think

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2-fp8",
    "messages": [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}],
    "temperature": 1,
    "max_tokens": 4096,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

Benchmarking

Add --no-enable-prefix-caching to the server command for a clean measurement.

vllm bench serve \
  --model zai-org/GLM-5.2-FP8 \
  --dataset-name random \
  --random-input 8000 \
  --random-output 1024 \
  --request-rate 10 \
  --num-prompts 32 \
  --ignore-eos

Note: pure throughput benchmarks tend to under-report real speed, because MTP's acceptance rate is usually low in synthetic runs.

Troubleshooting

  • FP8 performance: DeepGEMM is required — install via install_deepgemm.sh.
  • MTP performance: We fixed some MTP acceptance rate issue in This PR. If you encounter MTP acceptance rate issue, please update your branch or refer to GLM-5.2 Docker Image.

References