zai-org/GLM-4.5
GLM-4.5 MoE language model (~358B total parameters, BF16) with built-in MTP layers for speculative decoding and native tool calling
View on HuggingFaceGuide
Overview
GLM-4.5 is a Mixture-of-Experts language model from Z-AI with ~358B total parameters. The checkpoint ships in both BF16 and native FP8 formats. FP8 models have minimal accuracy loss, so unless you need strict reproducibility for benchmarking, FP8 is the recommended precision for lower-cost serving. All GLM-4.X models include built-in Multi-Token Prediction (MTP) layers that enable speculative decoding for higher generation throughput.
Prerequisites
- vLLM version: >= 0.11.0 (latest stable recommended)
- Hardware: 8x H200 (BF16) or 4x-8x H200 (FP8), AMD MI300X / MI325X / MI355X for ROCm
- Python: 3.10 - 3.13 (3.12 required for ROCm wheels)
Install vLLM (NVIDIA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35.
Launching the Server
Tensor Parallel (FP8 on 8 GPUs)
vllm serve zai-org/GLM-4.5-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
Enabling MTP Speculative Decoding
vllm serve zai-org/GLM-4.5-FP8 \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
Use --speculative-config.num_speculative_tokens 1 for optimal throughput. Higher
values increase mean acceptance length but drop acceptance rate significantly.
Tuning Tips
--max-model-len=65536works well for most scenarios; max is 128K.--max-num-batched-tokens=32768is a good default for prompt-heavy workloads. Reduce to 16K/8K to cut activation memory and latency.- Set
--gpu-memory-utilization=0.95to maximize KV cache headroom.
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-4.5-FP8",
messages=[{"role": "user", "content": "Explain MTP speculative decoding."}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Benchmarking
Disable prefix caching with --no-enable-prefix-caching on the server command,
then:
vllm bench serve \
--model zai-org/GLM-4.5-FP8 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Troubleshooting
- Tool calling not firing: Ensure
--tool-call-parser glm45 --enable-auto-tool-choiceare both present. - MTP memory overhead: MTP adds memory for draft computations. Monitor GPU memory
and reduce
--max-model-lenor--max-num-batched-tokensif you OOM. - Low MTP acceptance: If acceptance rate is below ~90%, drop speculative tokens to 1.