{
  "hf_id": "zai-org/GLM-5.2",
  "meta": {
    "title": "GLM-5.2",
    "slug": "glm-5.2",
    "provider": "GLM (Z-AI)",
    "description": "GLM-5.2 — frontier-scale MoE language model (~743B total parameters, 39B active) with up to 5-token MTP speculative decoding and thinking mode",
    "date_updated": "2026-06-27",
    "difficulty": "advanced",
    "tasks": [
      "text"
    ],
    "performance_headline": "Latest GLM-5 series MoE with extended MTP (5 draft tokens), improved reasoning and agentic performance",
    "related_recipes": [
      "zai-org/GLM-5",
      "zai-org/GLM-5.1"
    ],
    "hardware": {
      "b300": "verified",
      "b200": "verified",
      "mi300x": "verified",
      "mi355x": "verified"
    }
  },
  "recommended_command": {
    "hardware": "h200",
    "strategy": "single_node_tp",
    "variant": "default",
    "node_count": 1,
    "deploy_type": "single_node",
    "env": {},
    "docker_image": "vllm/vllm-openai:v0.23.0",
    "command": "vllm serve zai-org/GLM-5.2-FP8 \\\n  --tensor-parallel-size 8 \\\n  --kv-cache-dtype fp8 \\\n  --tool-call-parser glm47 \\\n  --enable-auto-tool-choice \\\n  --reasoning-parser glm45",
    "argv": [
      "vllm",
      "serve",
      "zai-org/GLM-5.2-FP8",
      "--tensor-parallel-size",
      "8",
      "--kv-cache-dtype",
      "fp8",
      "--tool-call-parser",
      "glm47",
      "--enable-auto-tool-choice",
      "--reasoning-parser",
      "glm45"
    ],
    "docker_command": "docker run --gpus all \\\n  --privileged --ipc=host -p 8000:8000 \\\n  -v ~/.cache/huggingface:/root/.cache/huggingface \\\n  vllm/vllm-openai:v0.23.0 zai-org/GLM-5.2-FP8 \\\n  --tensor-parallel-size 8 \\\n  --kv-cache-dtype fp8 \\\n  --tool-call-parser glm47 \\\n  --enable-auto-tool-choice \\\n  --reasoning-parser glm45",
    "docker_argv": [
      "docker",
      "run",
      "--gpus",
      "all",
      "--privileged",
      "--ipc=host",
      "-p",
      "8000:8000",
      "-v",
      "~/.cache/huggingface:/root/.cache/huggingface",
      "vllm/vllm-openai:v0.23.0",
      "zai-org/GLM-5.2-FP8",
      "--tensor-parallel-size",
      "8",
      "--kv-cache-dtype",
      "fp8",
      "--tool-call-parser",
      "glm47",
      "--enable-auto-tool-choice",
      "--reasoning-parser",
      "glm45"
    ],
    "strategy_spec": {
      "name": "single_node_tp",
      "deploy_type": "single_node",
      "display_name": "Tensor Parallel",
      "orientation": "latency",
      "description": "Single-node tensor parallel. Splits the model across all local GPUs. TP size is set to the GPU count at deploy time. The simplest multi-GPU strategy — works for all model architectures.\n",
      "hardware_match": {
        "min_gpus": 1,
        "max_gpus": 8,
        "multi_node": false
      },
      "vllm_args": [],
      "parallel_flag": "--tensor-parallel-size"
    },
    "hardware_profile": {
      "brand": "NVIDIA",
      "generation": "hopper",
      "display_name": "H200",
      "description": "NVIDIA H200 SXM 141 GB HBM3e · 8-GPU HGX node",
      "gpu_count": 8,
      "vram_gb": 1128,
      "multi_node": false
    },
    "alternatives": {
      "single_node_tep": "/zai-org/GLM-5.2/strategies/single_node_tep.json",
      "multi_node_tp": "/zai-org/GLM-5.2/strategies/multi_node_tp.json",
      "multi_node_tp_pp": "/zai-org/GLM-5.2/strategies/multi_node_tp_pp.json",
      "multi_node_tep": "/zai-org/GLM-5.2/strategies/multi_node_tep.json",
      "multi_node_dep": "/zai-org/GLM-5.2/strategies/multi_node_dep.json",
      "pd_cluster": "/zai-org/GLM-5.2/strategies/pd_cluster.json"
    },
    "by_hardware": {
      "h200": "/zai-org/GLM-5.2/hw/h200.json",
      "h100": "/zai-org/GLM-5.2/hw/h100.json",
      "b200": "/zai-org/GLM-5.2/hw/b200.json",
      "gb200": "/zai-org/GLM-5.2/hw/gb200.json",
      "b300": "/zai-org/GLM-5.2/hw/b300.json",
      "gb300": "/zai-org/GLM-5.2/hw/gb300.json",
      "mi300x": "/zai-org/GLM-5.2/hw/mi300x.json",
      "mi325x": "/zai-org/GLM-5.2/hw/mi325x.json",
      "mi355x": "/zai-org/GLM-5.2/hw/mi355x.json"
    }
  },
  "model": {
    "model_id": "zai-org/GLM-5.2",
    "min_vllm_version": "0.23.0",
    "docker_image": {
      "nvidia": "vllm/vllm-openai:v0.23.0",
      "amd": "vllm/vllm-openai-rocm:nightly-4c626633159887b0f2c962058c17c78f1434556d"
    },
    "architecture": "moe",
    "parameter_count": "743B",
    "active_parameters": "39B",
    "context_length": 1048576,
    "base_args": [
      "--kv-cache-dtype",
      "fp8_e4m3"
    ],
    "base_env": {},
    "install": {
      "pip": {
        "command": "uv venv\nsource .venv/bin/activate\nuv pip install -U vllm --torch-backend auto"
      },
      "docker": {
        "command": "docker pull vllm/vllm-openai:v0.23.0"
      }
    }
  },
  "features": {
    "tool_calling": {
      "description": "GLM-4.7 tool call parser with automatic tool choice",
      "args": [
        "--tool-call-parser",
        "glm47",
        "--enable-auto-tool-choice"
      ]
    },
    "reasoning": {
      "description": "GLM-4.5 reasoning parser — thinking mode enabled by default on requests",
      "args": [
        "--reasoning-parser",
        "glm45"
      ]
    },
    "spec_decoding": {
      "description": "Multi-Token Prediction speculative decoding (up to 5 draft tokens)",
      "args": [
        "--speculative-config.method",
        "mtp",
        "--speculative-config.num_speculative_tokens",
        "5"
      ]
    }
  },
  "opt_in_features": [
    "spec_decoding"
  ],
  "variants": {
    "default": {
      "model_id": "zai-org/GLM-5.2-FP8",
      "precision": "fp8",
      "vram_minimum_gb": 893,
      "description": "Native FP8 checkpoint — 8xH200/H20 (141GB × 8) single-node serving"
    },
    "nvfp4": {
      "model_id": "nvidia/GLM-5.2-NVFP4",
      "precision": "nvfp4",
      "vram_minimum_gb": 558,
      "description": "NVIDIA modelopt NVFP4 checkpoint — only MoE expert linears are quantized to NVFP4; shared experts, attention, embeddings, and the early dense layers stay FP8/BF16, with FP8 KV cache. Blackwell GPUs (B200/B300) only.",
      "json": "/nvidia/GLM-5.2-NVFP4.json"
    },
    "bf16": {
      "precision": "bf16",
      "vram_minimum_gb": 1786,
      "description": "Full precision BF16 — requires multi-node deployment"
    }
  },
  "hardware_overrides": {
    "hopper": {
      "extra_args": [
        "--kv-cache-dtype",
        "fp8"
      ]
    },
    "amd": {
      "extra_args": [
        "--linear-backend",
        "aiter",
        "--moe-backend",
        "aiter"
      ],
      "extra_env": {
        "VLLM_ROCM_USE_AITER": "1",
        "VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS": "1"
      }
    }
  },
  "guide": "## Overview\n\nGLM-5.2 is the newest model in the GLM-5 series — a ~743B-parameter MoE (39B active) from\nZ-AI. The\nheadline change over GLM-5 / 5.1 is that **Multi-Token Prediction (MTP) is extended from 3 to\n5 draft tokens**, lifting end-to-end throughput on reasoning, coding, and agentic workloads.\nIt ships as BF16 and native-FP8 checkpoints and keeps the GLM thinking-mode behavior.\n\nThis recipe targets the **FP8** checkpoint, the practical default: it fits on a single\n8xH200 / 8xH20 node and — with FP8 KV cache — reaches the full 1M-token context on 8xB200.\n\n## Prerequisites\n\n- **vLLM 0.23.0** (stable). If you need tool calling and MTP at the same time, use the latest\n  `main` branch.\n- **GPU:** 8xH200 or 8xH20 (141 GB each) for single-node FP8; 8xB200 (180 GB each) for the\n  full 1M context.\n\n## Installation\n\n### Docker\n\n```bash\ndocker run --gpus all \\\n  -p 8000:8000 \\\n  --ipc=host \\\n  -v ~/.cache/huggingface:/root/.cache/huggingface \\\n  vllm/vllm-openai:glm52 zai-org/GLM-5.2-FP8 \\\n    --tensor-parallel-size 8 \\\n    --tool-call-parser glm47 \\\n    --reasoning-parser glm45 \\\n    --enable-auto-tool-choice \\\n    --served-model-name glm-5.2-fp8 \\\n    --kv-cache-dtype fp8\n```\n\nOn CUDA 12.x, swap the image for `vllm/vllm-openai:glm52-cu129`.\n\n### From source\n\n```bash\nuv venv\nsource .venv/bin/activate\nuv pip install \"vllm==0.23.0\" --torch-backend=auto\nuv pip install \"transformers>=5.9.0\"\n```\n\n## Launching the server\n\n### FP8 on 8xH200 (standard)\n\n```bash\nvllm serve zai-org/GLM-5.2-FP8 \\\n  --kv-cache-dtype fp8 \\\n  --tensor-parallel-size 8 \\\n  --speculative-config.method mtp \\\n  --speculative-config.num_speculative_tokens 5 \\\n  --tool-call-parser glm47 \\\n  --reasoning-parser glm45 \\\n  --enable-auto-tool-choice \\\n  --served-model-name glm-5.2-fp8\n```\n\n### FP8 on AMD MI300X/MI355X (full 1M context)\n\nGLM-5.2 has a native 1M-token window. Whether the full window fits on ROCm is a\nKV-cache VRAM question, so the levers are `--max-model-len` and `--max-num-seqs`.\nStart with the values below, then scale them with your node's HBM and workload:\nraise the context window when startup reports KV-cache headroom, or lower the\nsequence cap if long prompts OOM under concurrency.\n\n```bash\nVLLM_ROCM_USE_AITER=1 \\\nVLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 \\\nvllm serve zai-org/GLM-5.2-FP8 \\\n  --kv-cache-dtype fp8_e4m3 \\\n  --tensor-parallel-size 8 \\\n  --speculative-config.method mtp \\\n  --speculative-config.num_speculative_tokens 5 \\\n  --tool-call-parser glm47 \\\n  --enable-auto-tool-choice \\\n  --reasoning-parser glm45 \\\n  --gpu-memory-utilization 0.80 \\\n  --max-model-len 524288 \\\n  --max-num-seqs 32 \\\n  --linear-backend aiter \\\n  --moe-backend aiter\n```\n\n- **`--max-model-len`** — caps the served context window; raise it toward the native\n  1M window when your HBM budget and workload leave KV-cache headroom.\n- **`--max-num-seqs 32`** — the main knob for fitting long context under concurrency;\n  start at 32 and tune it to your HBM (up on headroom, down on OOM).\n- **`--gpu-memory-utilization 0.80`** — leaves ROCm runtime headroom for MTP graph capture\n  and inference; raise it only after a representative concurrent smoke test.\n- **`--speculative-config.num_speculative_tokens 5`** — enables GLM-5.2's 5-token MTP\n  path; reduce it if your workload shows low acceptance or higher latency.\n- **`VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1`** — enables the shared-expert fused\n  MoE path. `VLLM_ROCM_USE_AITER_LINEAR` and `VLLM_ROCM_USE_AITER_MOE` default to enabled\n  when `VLLM_ROCM_USE_AITER=1`.\n\n### FP8 on 8xB200 (full 1M context)\n\nGLM-5.2 has a native 1M-token window. Whether the full window fits is a KV-cache VRAM\nquestion, so the lever is `--max-num-seqs` — it bounds how many sequences share the KV budget\nat once, leaving room for each to hold a long context. **Start at 32** and scale with your\nnode's VRAM: raise it on larger trays (e.g. 8xB300) or short-prompt traffic, lower it if you\nOOM at full context. FP8 KV cache (`--kv-cache-dtype fp8_e4m3`, already in the base flags)\nroughly halves that budget, which is what makes 1M reachable at all.\n\n```bash\nVLLM_DEEP_GEMM_WARMUP=skip vllm serve zai-org/GLM-5.2-FP8 \\\n  --kv-cache-dtype fp8_e4m3 \\\n  --tensor-parallel-size 8 \\\n  --speculative-config.method mtp \\\n  --speculative-config.num_speculative_tokens 5 \\\n  --max-num-seqs 32 \\\n  --tool-call-parser glm47 \\\n  --reasoning-parser glm45 \\\n  --enable-auto-tool-choice \\\n  --served-model-name glm-5.2-fp8\n```\n\n- **`--max-num-seqs 32`** — the single knob for fitting 1M context; start at 32 and tune it to\n  your VRAM (up on headroom, down on OOM).\n- **`VLLM_DEEP_GEMM_WARMUP=skip`** — skips DeepGEMM JIT warmup for a faster startup; the first\n  few requests compile kernels on demand instead.\n- **BF16** needs multi-node plus an extra loader flag — see [Troubleshooting](#troubleshooting).\n\n### NVFP4 on Blackwell (B200/B300)\n\nThe `nvidia/GLM-5.2-NVFP4` variant is NVIDIA's modelopt re-quantization: only the MoE\nexpert linears drop to NVFP4 while shared experts, attention, embeddings, and the early\ndense layers stay FP8/BF16, with FP8 KV cache. The ~465 GB checkpoint fits comfortably on\nBlackwell — vLLM auto-detects the quantization from the checkpoint, so no `--quantization`\nflag is needed. Select the **NVFP4** variant above (Blackwell-only) or run NVIDIA's command:\n\n```bash\nvllm serve nvidia/GLM-5.2-NVFP4 \\\n  --tensor-parallel-size 8 \\\n  --enable-expert-parallel \\\n  --reasoning-parser glm45 \\\n  --tool-call-parser glm47 \\\n  --enable-auto-tool-choice \\\n  --kv-cache-dtype fp8_e4m3 \\\n  --served-model-name glm-5.2-nvfp4\n```\n\n## Reasoning modes\n\nThinking is **on by default**. GLM-5.2 reuses the DeepSeek-V4 `reasoning_effort` mechanism,\nwith two effort levels driven by the `reasoning_effort` field:\n\n| Mode | How to request | Behavior |\n|------|----------------|----------|\n| **Think Max** (default) | omit `reasoning_effort`, or set `\"max\"` | Deepest reasoning — hard math, multi-step planning, agentic tasks. Highest token cost. |\n| **Think High** | `\"reasoning_effort\": \"high\"` | Balanced depth and latency. |\n| **Non-think** | `chat_template_kwargs.enable_thinking: false` | Fast, no chain-of-thought. |\n\nThe chat template resolves effort to **`max` unless `reasoning_effort` is explicitly `\"high\"`**,\nso Max is the default and High is opt-in. Pass it through `chat_template_kwargs` (the\nDeepSeek-V4 path) or the top-level OpenAI `reasoning_effort` field; it has no effect when\nthinking is disabled.\n\n## Client usage\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8000/v1\")\nmsgs = [{\"role\": \"user\", \"content\": \"Summarize GLM-5.2 in one sentence.\"}]\n\n# Think Max (default) — just omit reasoning_effort\nclient.chat.completions.create(model=\"glm-5.2-fp8\", messages=msgs, max_tokens=4096)\n\n# Think High — explicitly request reasoning_effort: \"high\"\nclient.chat.completions.create(\n    model=\"glm-5.2-fp8\",\n    messages=msgs,\n    max_tokens=4096,\n    extra_body={\"chat_template_kwargs\": {\"reasoning_effort\": \"high\"}},\n)\n\n# Non-think\nclient.chat.completions.create(\n    model=\"glm-5.2-fp8\",\n    messages=msgs,\n    max_tokens=4096,\n    extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}},\n)\n```\n\n### cURL — Think High\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"glm-5.2-fp8\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Summarize GLM-5.2 in one sentence.\"}],\n    \"temperature\": 1,\n    \"max_tokens\": 4096,\n    \"chat_template_kwargs\": {\"reasoning_effort\": \"high\"}\n  }'\n```\n\n### cURL — non-think\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"glm-5.2-fp8\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Summarize GLM-5.2 in one sentence.\"}],\n    \"temperature\": 1,\n    \"max_tokens\": 4096,\n    \"chat_template_kwargs\": {\"enable_thinking\": false}\n  }'\n```\n\n## Benchmarking\n\nAdd `--no-enable-prefix-caching` to the server command for a clean measurement.\n\n```bash\nvllm bench serve \\\n  --model zai-org/GLM-5.2-FP8 \\\n  --dataset-name random \\\n  --random-input 8000 \\\n  --random-output 1024 \\\n  --request-rate 10 \\\n  --num-prompts 32 \\\n  --ignore-eos\n```\n\n> **Note:** pure throughput benchmarks tend to under-report real speed, because MTP's\n> acceptance rate is usually low in synthetic runs.\n\n## Troubleshooting\n\n- **FP8 performance:** DeepGEMM is required — install via `install_deepgemm.sh`.\n- **MTP performance:** We fixed some MTP acceptance rate issue in [This PR](https://github.com/vllm-project/vllm/pull/45895). If you encounter MTP acceptance rate issue, please update your branch or refer to [GLM-5.2 Docker Image](https://hub.docker.com/r/vllm/vllm-openai/tags?name=glm52).\n\n## References\n\n- [Model card](https://huggingface.co/zai-org/GLM-5.2)\n- [FP8 checkpoint](https://huggingface.co/zai-org/GLM-5.2-FP8)\n- [NVFP4 checkpoint (NVIDIA)](https://huggingface.co/nvidia/GLM-5.2-NVFP4)\n"
}