vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-5.1

GLM-5.1 refreshed version of GLM-5 — frontier-scale MoE language model (~744B total parameters) with MTP speculative decoding and thinking mode

View on HuggingFace
moe744B / 40B202,752 ctxvLLM 0.19.0+text
Guide

Overview

GLM-5.1 is a refreshed version of GLM-5, the 744B parameter frontier MoE model from Z-AI. It keeps the asynchronous RL training recipe and delivers best-in-class open-source performance on reasoning, coding, and agentic benchmarks. Both BF16 and native FP8 checkpoints are published.

Thinking mode is enabled by default; disable it by passing "chat_template_kwargs": {"enable_thinking": false} in request extras.

Prerequisites

  • vLLM version: 0.19.0 (stable — preferred over nightly for model performance). Use the latest main branch if you need tool calling + MTP simultaneously.
  • Hardware (FP8): 8xH200 or 8xH20 (141GB × 8)
  • DeepGEMM (FP8): install via install_deepgemm.sh from vLLM repo

Using Docker

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm51 zai-org/GLM-5.1-FP8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --chat-template-content-format=string \
    --served-model-name glm-5.1-fp8

Use vllm/vllm-openai:glm51-cu130 for CUDA 13+.

Install vLLM from Source

uv venv
source .venv/bin/activate
uv pip install "vllm==0.19.0" --torch-backend=auto
uv pip install "transformers>=5.4.0"

Launching the Server

FP8 on 8xH200 with MTP

vllm serve zai-org/GLM-5.1-FP8 \
     --tensor-parallel-size 8 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 3 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --chat-template-content-format=string \
     --served-model-name glm-5.1-fp8

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

# Thinking ON (default)
resp_on = client.chat.completions.create(
    model="glm-5.1-fp8",
    messages=[{"role": "user", "content": "Summarize GLM-5.1 in one sentence."}],
    temperature=1,
    max_tokens=4096,
)
print(resp_on.choices[0].message.reasoning)

# Thinking OFF
resp_off = client.chat.completions.create(
    model="glm-5.1-fp8",
    messages=[{"role": "user", "content": "Summarize GLM-5.1 in one sentence."}],
    temperature=1,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

cURL (Thinking ON)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.1-fp8",
    "messages": [
      {"role": "user", "content": "Summarize GLM-5.1 in one sentence."}
    ],
    "temperature": 1,
    "max_tokens": 4096
  }'

Benchmarking

vllm bench serve \
  --model zai-org/GLM-5.1-FP8 \
  --dataset-name random \
  --random-input 8000 \
  --random-output 1024 \
  --request-rate 10 \
  --num-prompts 32 \
  --ignore-eos

Troubleshooting

  • Accuracy drift: Prefer the 0.19.0 stable release for best accuracy.
  • Tool calling + MTP: If both are needed, use the latest vLLM main branch.
  • FP8 installation: DeepGEMM required for FP8 performance.

References