zai-org/GLM-5.2
GLM-5.2 — frontier-scale MoE language model (~743B total parameters, 39B active) with up to 5-token MTP speculative decoding and thinking mode
Latest GLM-5 series MoE with extended MTP (5 draft tokens), improved reasoning and agentic performance
Guide
Overview
GLM-5.2 is the newest model in the GLM-5 series — a ~743B-parameter MoE (39B active) from Z-AI. The headline change over GLM-5 / 5.1 is that Multi-Token Prediction (MTP) is extended from 3 to 5 draft tokens, lifting end-to-end throughput on reasoning, coding, and agentic workloads. It ships as BF16 and native-FP8 checkpoints and keeps the GLM thinking-mode behavior.
This recipe targets the FP8 checkpoint, the practical default: it fits on a single 8xH200 / 8xH20 node and — with FP8 KV cache — reaches the full 1M-token context on 8xB200.
Prerequisites
- vLLM 0.23.0 (stable). If you need tool calling and MTP at the same time, use the latest
mainbranch. - GPU: 8xH200 or 8xH20 (141 GB each) for single-node FP8; 8xB200 (180 GB each) for the full 1M context.
Installation
Docker
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:glm52 zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8 \
--kv-cache-dtype fp8
On CUDA 12.x, swap the image for vllm/vllm-openai:glm52-cu129.
From source
uv venv
source .venv/bin/activate
uv pip install "vllm==0.23.0" --torch-backend=auto
uv pip install "transformers>=5.9.0"
Launching the server
FP8 on 8xH200 (standard)
vllm serve zai-org/GLM-5.2-FP8 \
--kv-cache-dtype fp8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 5 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8
FP8 on AMD MI300X/MI355X (full 1M context)
GLM-5.2 has a native 1M-token window. Whether the full window fits on ROCm is a
KV-cache VRAM question, so the levers are --max-model-len and --max-num-seqs.
Start with the values below, then scale them with your node's HBM and workload:
raise the context window when startup reports KV-cache headroom, or lower the
sequence cap if long prompts OOM under concurrency.
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 \
vllm serve zai-org/GLM-5.2-FP8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 5 \
--tool-call-parser glm47 \
--enable-auto-tool-choice \
--reasoning-parser glm45 \
--gpu-memory-utilization 0.80 \
--max-model-len 524288 \
--max-num-seqs 32 \
--linear-backend aiter \
--moe-backend aiter
--max-model-len— caps the served context window; raise it toward the native 1M window when your HBM budget and workload leave KV-cache headroom.--max-num-seqs 32— the main knob for fitting long context under concurrency; start at 32 and tune it to your HBM (up on headroom, down on OOM).--gpu-memory-utilization 0.80— leaves ROCm runtime headroom for MTP graph capture and inference; raise it only after a representative concurrent smoke test.--speculative-config.num_speculative_tokens 5— enables GLM-5.2's 5-token MTP path; reduce it if your workload shows low acceptance or higher latency.VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1— enables the shared-expert fused MoE path.VLLM_ROCM_USE_AITER_LINEARandVLLM_ROCM_USE_AITER_MOEdefault to enabled whenVLLM_ROCM_USE_AITER=1.
FP8 on 8xB200 (full 1M context)
GLM-5.2 has a native 1M-token window. Whether the full window fits is a KV-cache VRAM
question, so the lever is --max-num-seqs — it bounds how many sequences share the KV budget
at once, leaving room for each to hold a long context. Start at 32 and scale with your
node's VRAM: raise it on larger trays (e.g. 8xB300) or short-prompt traffic, lower it if you
OOM at full context. FP8 KV cache (--kv-cache-dtype fp8_e4m3, already in the base flags)
roughly halves that budget, which is what makes 1M reachable at all.
VLLM_DEEP_GEMM_WARMUP=skip vllm serve zai-org/GLM-5.2-FP8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 5 \
--max-num-seqs 32 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8
--max-num-seqs 32— the single knob for fitting 1M context; start at 32 and tune it to your VRAM (up on headroom, down on OOM).VLLM_DEEP_GEMM_WARMUP=skip— skips DeepGEMM JIT warmup for a faster startup; the first few requests compile kernels on demand instead.- BF16 needs multi-node plus an extra loader flag — see Troubleshooting.
Reasoning modes
Thinking is on by default. GLM-5.2 reuses the DeepSeek-V4 reasoning_effort mechanism,
with two effort levels driven by the reasoning_effort field:
| Mode | How to request | Behavior |
|---|---|---|
| Think Max (default) | omit reasoning_effort, or set "max" | Deepest reasoning — hard math, multi-step planning, agentic tasks. Highest token cost. |
| Think High | "reasoning_effort": "high" | Balanced depth and latency. |
| Non-think | chat_template_kwargs.enable_thinking: false | Fast, no chain-of-thought. |
The chat template resolves effort to max unless reasoning_effort is explicitly "high",
so Max is the default and High is opt-in. Pass it through chat_template_kwargs (the
DeepSeek-V4 path) or the top-level OpenAI reasoning_effort field; it has no effect when
thinking is disabled.
Client usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
msgs = [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}]
# Think Max (default) — just omit reasoning_effort
client.chat.completions.create(model="glm-5.2-fp8", messages=msgs, max_tokens=4096)
# Think High — explicitly request reasoning_effort: "high"
client.chat.completions.create(
model="glm-5.2-fp8",
messages=msgs,
max_tokens=4096,
extra_body={"chat_template_kwargs": {"reasoning_effort": "high"}},
)
# Non-think
client.chat.completions.create(
model="glm-5.2-fp8",
messages=msgs,
max_tokens=4096,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
cURL — Think High
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}],
"temperature": 1,
"max_tokens": 4096,
"chat_template_kwargs": {"reasoning_effort": "high"}
}'
cURL — non-think
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}],
"temperature": 1,
"max_tokens": 4096,
"chat_template_kwargs": {"enable_thinking": false}
}'
Benchmarking
Add --no-enable-prefix-caching to the server command for a clean measurement.
vllm bench serve \
--model zai-org/GLM-5.2-FP8 \
--dataset-name random \
--random-input 8000 \
--random-output 1024 \
--request-rate 10 \
--num-prompts 32 \
--ignore-eos
Note: pure throughput benchmarks tend to under-report real speed, because MTP's acceptance rate is usually low in synthetic runs.
Troubleshooting
- FP8 performance: DeepGEMM is required — install via
install_deepgemm.sh. - MTP performance: We fixed some MTP acceptance rate issue in This PR. If you encounter MTP acceptance rate issue, please update your branch or refer to GLM-5.2 Docker Image.