vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-5

GLM-5 frontier-scale MoE language model (~744B total parameters, 28.5T training tokens) with asynchronous RL infrastructure for reasoning, coding, and agentic tasks

View on HuggingFace
moe744B / 40B202,752 ctxvLLM 0.16.0+text
Guide

Overview

GLM-5 is a significantly scaled-up language model with 744B parameters, trained on 28.5T tokens using novel asynchronous RL infrastructure. It delivers best-in-class open-source performance on reasoning, coding, and agentic tasks, rivaling frontier closed-source models. GLM-5 is available in both BF16 and native FP8 precisions.

Thinking mode is enabled by default; disable it by passing "chat_template_kwargs": {"enable_thinking": false} in request extras.

Prerequisites

  • vLLM version: 0.19.0 (stable — preferred over nightly for model performance)
  • Hardware (FP8): 8xH200 or 8xH20 (141GB × 8)
  • DeepGEMM (FP8): install via install_deepgemm.sh from the vLLM repo

Using Docker

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:glm51 zai-org/GLM-5-FP8 \
    --tensor-parallel-size 8 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --chat-template-content-format=string

Use vllm/vllm-openai:glm51-cu130 for CUDA 13+.

Install vLLM from Source

uv venv
source .venv/bin/activate
uv pip install "vllm==0.19.0" --torch-backend=auto
uv pip install "transformers>=5.4.0"

Install DeepGEMM using install_deepgemm.sh from the vLLM tools directory.

Launching the Server

FP8 on 8xH200 with MTP

vllm serve zai-org/GLM-5-FP8 \
     --tensor-parallel-size 8 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 3 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --chat-template-content-format=string \
     --served-model-name glm-5-fp8

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

# Thinking ON (default)
resp_on = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[{"role": "user", "content": "Summarize GLM-5 in one sentence."}],
    temperature=1,
    max_tokens=4096,
)
print("thinking=on, think content:\n", resp_on.choices[0].message.reasoning)

# Thinking OFF
resp_off = client.chat.completions.create(
    model="glm-5-fp8",
    messages=[{"role": "user", "content": "Summarize GLM-5 in one sentence."}],
    temperature=1,
    max_tokens=4096,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print("thinking=off:\n", resp_off.choices[0].message.reasoning)

Benchmarking

vllm bench serve \
  --model zai-org/GLM-5-FP8 \
  --dataset-name random \
  --random-input 8000 \
  --random-output 1024 \
  --request-rate 10 \
  --num-prompts 32 \
  --ignore-eos

The MTP acceptance rate can be relatively low in pure benchmarks; measured throughput may underestimate real-world speed.

Troubleshooting

  • Accuracy drift: Prefer the 0.19.0 stable release over nightly for best accuracy.
  • Tool calling + MTP: If you need both, use the latest vLLM main branch.
  • FP8 installation: DeepGEMM is required for FP8 performance.

References