zai-org/GLM-5.1
GLM-5.1 refreshed version of GLM-5 — frontier-scale MoE language model (~744B total parameters) with MTP speculative decoding and thinking mode
View on HuggingFaceGuide
Overview
GLM-5.1 is a refreshed version of GLM-5, the 744B parameter frontier MoE model from Z-AI. It keeps the asynchronous RL training recipe and delivers best-in-class open-source performance on reasoning, coding, and agentic benchmarks. Both BF16 and native FP8 checkpoints are published.
Thinking mode is enabled by default; disable it by passing
"chat_template_kwargs": {"enable_thinking": false} in request extras.
Prerequisites
- vLLM version: 0.19.0 (stable — preferred over nightly for model performance). Use the latest main branch if you need tool calling + MTP simultaneously.
- Hardware (FP8): 8xH200 or 8xH20 (141GB × 8)
- DeepGEMM (FP8): install via
install_deepgemm.shfrom vLLM repo
Using Docker
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:glm51 zai-org/GLM-5.1-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--chat-template-content-format=string \
--served-model-name glm-5.1-fp8
Use vllm/vllm-openai:glm51-cu130 for CUDA 13+.
Install vLLM from Source
uv venv
source .venv/bin/activate
uv pip install "vllm==0.19.0" --torch-backend=auto
uv pip install "transformers>=5.4.0"
Launching the Server
FP8 on 8xH200 with MTP
vllm serve zai-org/GLM-5.1-FP8 \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 3 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--chat-template-content-format=string \
--served-model-name glm-5.1-fp8
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
# Thinking ON (default)
resp_on = client.chat.completions.create(
model="glm-5.1-fp8",
messages=[{"role": "user", "content": "Summarize GLM-5.1 in one sentence."}],
temperature=1,
max_tokens=4096,
)
print(resp_on.choices[0].message.reasoning)
# Thinking OFF
resp_off = client.chat.completions.create(
model="glm-5.1-fp8",
messages=[{"role": "user", "content": "Summarize GLM-5.1 in one sentence."}],
temperature=1,
max_tokens=4096,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
cURL (Thinking ON)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.1-fp8",
"messages": [
{"role": "user", "content": "Summarize GLM-5.1 in one sentence."}
],
"temperature": 1,
"max_tokens": 4096
}'
Benchmarking
vllm bench serve \
--model zai-org/GLM-5.1-FP8 \
--dataset-name random \
--random-input 8000 \
--random-output 1024 \
--request-rate 10 \
--num-prompts 32 \
--ignore-eos
Troubleshooting
- Accuracy drift: Prefer the 0.19.0 stable release for best accuracy.
- Tool calling + MTP: If both are needed, use the latest vLLM main branch.
- FP8 installation: DeepGEMM required for FP8 performance.