zai-org/GLM-5.2
GLM-5.2 — frontier-scale MoE language model (~743B total parameters, 39B active) with up to 5-token MTP speculative decoding and thinking mode
Latest GLM-5 series MoE with extended MTP (5 draft tokens), improved reasoning and agentic performance
Guide
Overview
GLM-5.2 is the newest model in the GLM-5 series — a ~743B-parameter MoE (39B active) from Z-AI. The headline change over GLM-5 / 5.1 is that Multi-Token Prediction (MTP) is extended from 3 to 5 draft tokens, lifting end-to-end throughput on reasoning, coding, and agentic workloads. It ships as BF16 and native-FP8 checkpoints and keeps the GLM thinking-mode behavior.
This recipe targets the FP8 checkpoint, the practical default: it fits on a single 8xH200 / 8xH20 node and — with FP8 KV cache — reaches the full 1M-token context on 8xB200.
Prerequisites
- vLLM 0.23.0 (stable). If you need tool calling and MTP at the same time, use the latest
mainbranch. - GPU: 8xH200 or 8xH20 (141 GB each) for single-node FP8; 8xB200 (180 GB each) for the full 1M context.
Installation
Docker
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.23.0 zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8 \
--kv-cache-dtype fp8_e4m3
On CUDA 12.x, swap the image for vllm/vllm-openai:v0.23.0-cu129.
From source
uv venv
source .venv/bin/activate
uv pip install "vllm==0.23.0" --torch-backend=auto
uv pip install "transformers>=5.9.0"
Launching the server
FP8 on 8xH200 (standard)
vllm serve zai-org/GLM-5.2-FP8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 5 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8
FP8 on 8xB200 (full 1M context)
GLM-5.2 has a native 1M-token window. Whether the full window fits is a KV-cache VRAM
question, so the lever is --max-num-seqs — it bounds how many sequences share the KV budget
at once, leaving room for each to hold a long context. Start at 32 and scale with your
node's VRAM: raise it on larger trays (e.g. 8xB300) or short-prompt traffic, lower it if you
OOM at full context. FP8 KV cache (--kv-cache-dtype fp8_e4m3, already in the base flags)
roughly halves that budget, which is what makes 1M reachable at all.
VLLM_DEEP_GEMM_WARMUP=skip vllm serve zai-org/GLM-5.2-FP8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 5 \
--max_num_seqs 32 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8
--max_num_seqs 32— the single knob for fitting 1M context; start at 32 and tune it to your VRAM (up on headroom, down on OOM).VLLM_DEEP_GEMM_WARMUP=skip— skips DeepGEMM JIT warmup for a faster startup; the first few requests compile kernels on demand instead.- BF16 needs multi-node plus an extra loader flag — see Troubleshooting.
Thinking mode & effort
Thinking is on by default. Turn it off per request with
chat_template_kwargs.enable_thinking: false.
GLM-5.2 also exposes a thinking-effort dial that trades reasoning depth for latency:
| Effort | When to use |
|---|---|
max | Deepest reasoning — hard math, multi-step planning, agentic tasks. Highest token cost. |
high | Balanced depth and latency. A good default for most workloads. |
Set it with chat_template_kwargs.thinking_effort, or via the OpenAI-compatible
reasoning_effort field. It has no effect when thinking is disabled.
Client usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
msgs = [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}]
# Thinking effort = "max" (or "high")
r = client.chat.completions.create(
model="glm-5.2-fp8",
messages=msgs,
temperature=1,
max_tokens=4096,
extra_body={"chat_template_kwargs": {"thinking_effort": "max"}},
)
print(r.choices[0].message.reasoning)
# Equivalent via the OpenAI reasoning_effort field
client.chat.completions.create(
model="glm-5.2-fp8", messages=msgs, extra_body={"reasoning_effort": "high"},
)
# Thinking OFF (thinking_effort is ignored)
client.chat.completions.create(
model="glm-5.2-fp8",
messages=msgs,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
cURL — thinking effort max
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}],
"temperature": 1,
"max_tokens": 4096,
"chat_template_kwargs": {"thinking_effort": "max"}
}'
cURL — thinking off
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Summarize GLM-5.2 in one sentence."}],
"temperature": 1,
"max_tokens": 4096,
"chat_template_kwargs": {"enable_thinking": false}
}'
Benchmarking
Add --no-enable-prefix-caching to the server command for a clean measurement.
vllm bench serve \
--model zai-org/GLM-5.2-FP8 \
--dataset-name random \
--random-input 8000 \
--random-output 1024 \
--request-rate 10 \
--num-prompts 32 \
--ignore-eos
Note: pure throughput benchmarks tend to under-report real speed, because MTP's acceptance rate is usually low in synthetic runs.
Troubleshooting
- BF16 weight-load error (
...indexer.k_norm...not initialized): GLM-5.2 uses DeepSeek-style sparse attention, where IndexCache "skip" layers reuse a neighbour layer's top-k indices and carry no indexer weights in the checkpoint. vLLM still builds those indexer modules, so under BF16 the post-load strict weight check raisesValueError: Following weights were not initialized from checkpoint: {...indexer.k_norm...}. Thebf16variant works around this with--model-loader-extra-config.enable_weights_track=false(already injected into its generated command). This is safe because skip layers never run their indexer (those weights are never read). FP8 is unaffected — the strict check is disabled for quantized models. - Tool calling + MTP: If both are needed simultaneously, use the latest vLLM main branch.
- FP8 performance: DeepGEMM is required — install via
install_deepgemm.sh.