{
  "hf_id": "zai-org/GLM-5.1-FP8",
  "meta": {
    "title": "GLM-5.1",
    "slug": "glm-5.1",
    "provider": "GLM (Z-AI)",
    "description": "GLM-5.1 refreshed version of GLM-5 — frontier-scale MoE language model (~744B total parameters) with MTP speculative decoding and thinking mode",
    "date_updated": "2026-05-21",
    "difficulty": "advanced",
    "tasks": [
      "text"
    ],
    "performance_headline": "Refreshed GLM-5 series MoE with improved reasoning, coding, and agentic performance",
    "related_recipes": [],
    "hardware": {
      "h200": "verified"
    },
    "derived_from": "zai-org/GLM-5.1",
    "variant": "fp8"
  },
  "recommended_command": {
    "hardware": "h200",
    "strategy": "single_node_tp",
    "variant": "fp8",
    "node_count": 1,
    "deploy_type": "single_node",
    "env": {},
    "docker_image": "vllm/vllm-openai:latest",
    "command": "vllm serve zai-org/GLM-5.1-FP8 \\\n  --trust-remote-code \\\n  --chat-template-content-format=string \\\n  --tensor-parallel-size 8 \\\n  --tool-call-parser glm47 \\\n  --enable-auto-tool-choice \\\n  --reasoning-parser glm45",
    "argv": [
      "vllm",
      "serve",
      "zai-org/GLM-5.1-FP8",
      "--trust-remote-code",
      "--chat-template-content-format=string",
      "--tensor-parallel-size",
      "8",
      "--tool-call-parser",
      "glm47",
      "--enable-auto-tool-choice",
      "--reasoning-parser",
      "glm45"
    ],
    "docker_command": "docker run --gpus all \\\n  --privileged --ipc=host -p 8000:8000 \\\n  -v ~/.cache/huggingface:/root/.cache/huggingface \\\n  vllm/vllm-openai:latest zai-org/GLM-5.1-FP8 \\\n  --trust-remote-code \\\n  --chat-template-content-format=string \\\n  --tensor-parallel-size 8 \\\n  --tool-call-parser glm47 \\\n  --enable-auto-tool-choice \\\n  --reasoning-parser glm45",
    "docker_argv": [
      "docker",
      "run",
      "--gpus",
      "all",
      "--privileged",
      "--ipc=host",
      "-p",
      "8000:8000",
      "-v",
      "~/.cache/huggingface:/root/.cache/huggingface",
      "vllm/vllm-openai:latest",
      "zai-org/GLM-5.1-FP8",
      "--trust-remote-code",
      "--chat-template-content-format=string",
      "--tensor-parallel-size",
      "8",
      "--tool-call-parser",
      "glm47",
      "--enable-auto-tool-choice",
      "--reasoning-parser",
      "glm45"
    ],
    "strategy_spec": {
      "name": "single_node_tp",
      "deploy_type": "single_node",
      "display_name": "Tensor Parallel",
      "orientation": "latency",
      "description": "Single-node tensor parallel. Splits the model across all local GPUs. TP size is set to the GPU count at deploy time. The simplest multi-GPU strategy — works for all model architectures.\n",
      "hardware_match": {
        "min_gpus": 1,
        "max_gpus": 8,
        "multi_node": false
      },
      "vllm_args": [],
      "parallel_flag": "--tensor-parallel-size"
    },
    "hardware_profile": {
      "brand": "NVIDIA",
      "generation": "hopper",
      "display_name": "H200",
      "description": "NVIDIA H200 SXM 141 GB HBM3e · 8-GPU HGX node",
      "gpu_count": 8,
      "vram_gb": 1128,
      "multi_node": false
    },
    "alternatives": {
      "single_node_tep": "/zai-org/GLM-5.1-FP8/strategies/single_node_tep.json",
      "multi_node_tp": "/zai-org/GLM-5.1-FP8/strategies/multi_node_tp.json",
      "multi_node_tp_pp": "/zai-org/GLM-5.1-FP8/strategies/multi_node_tp_pp.json",
      "multi_node_tep": "/zai-org/GLM-5.1-FP8/strategies/multi_node_tep.json",
      "multi_node_dep": "/zai-org/GLM-5.1-FP8/strategies/multi_node_dep.json",
      "pd_cluster": "/zai-org/GLM-5.1-FP8/strategies/pd_cluster.json"
    },
    "by_hardware": {
      "h200": "/zai-org/GLM-5.1-FP8/hw/h200.json",
      "h100": "/zai-org/GLM-5.1-FP8/hw/h100.json",
      "b200": "/zai-org/GLM-5.1-FP8/hw/b200.json",
      "gb200": "/zai-org/GLM-5.1-FP8/hw/gb200.json",
      "b300": "/zai-org/GLM-5.1-FP8/hw/b300.json",
      "gb300": "/zai-org/GLM-5.1-FP8/hw/gb300.json",
      "mi300x": "/zai-org/GLM-5.1-FP8/hw/mi300x.json",
      "mi325x": "/zai-org/GLM-5.1-FP8/hw/mi325x.json",
      "mi355x": "/zai-org/GLM-5.1-FP8/hw/mi355x.json"
    }
  },
  "model": {
    "model_id": "zai-org/GLM-5.1-FP8",
    "min_vllm_version": "0.19.1",
    "architecture": "moe",
    "parameter_count": "744B",
    "active_parameters": "40B",
    "context_length": 202752,
    "base_args": [
      "--trust-remote-code",
      "--chat-template-content-format=string"
    ],
    "base_env": {},
    "install": {
      "pip": {
        "command": "uv venv\nsource .venv/bin/activate\nuv pip install -U vllm --torch-backend auto"
      },
      "docker": {
        "command": "docker pull vllm/vllm-openai:latest"
      }
    }
  },
  "features": {
    "tool_calling": {
      "description": "GLM-4.7 tool call parser with automatic tool choice",
      "args": [
        "--tool-call-parser",
        "glm47",
        "--enable-auto-tool-choice"
      ]
    },
    "reasoning": {
      "description": "GLM-4.5 reasoning parser — thinking mode enabled by default on requests",
      "args": [
        "--reasoning-parser",
        "glm45"
      ]
    },
    "spec_decoding": {
      "description": "Multi-Token Prediction speculative decoding (3 draft tokens)",
      "args": [
        "--speculative-config.method",
        "mtp",
        "--speculative-config.num_speculative_tokens",
        "3"
      ]
    }
  },
  "opt_in_features": [
    "spec_decoding"
  ],
  "variants": {
    "default": {
      "precision": "fp8",
      "vram_minimum_gb": 893,
      "description": "Native FP8 checkpoint — 8xH200/H20 (141GB × 8) single-node serving"
    }
  },
  "hardware_overrides": {},
  "guide": "## Overview\n\nGLM-5.1 is a refreshed version of GLM-5, the 744B parameter frontier MoE model from Z-AI.\nIt keeps the asynchronous RL training recipe and delivers best-in-class open-source\nperformance on reasoning, coding, and agentic benchmarks. Both BF16 and native FP8\ncheckpoints are published.\n\nThinking mode is enabled by default; disable it by passing\n`\"chat_template_kwargs\": {\"enable_thinking\": false}` in request extras.\n\n## Prerequisites\n\n- **vLLM version:** 0.19.0 (stable — preferred over nightly for model performance).\n  Use the latest main branch if you need tool calling + MTP simultaneously.\n- **Hardware (FP8):** 8xH200 or 8xH20 (141GB × 8)\n- **DeepGEMM (FP8):** install via `install_deepgemm.sh` from vLLM repo\n\n### Using Docker\n\n```bash\ndocker run --gpus all \\\n  -p 8000:8000 \\\n  --ipc=host \\\n  -v ~/.cache/huggingface:/root/.cache/huggingface \\\n  vllm/vllm-openai:latest zai-org/GLM-5.1-FP8 \\\n    --tensor-parallel-size 8 \\\n    --tool-call-parser glm47 \\\n    --reasoning-parser glm45 \\\n    --enable-auto-tool-choice \\\n    --chat-template-content-format=string \\\n    --served-model-name glm-5.1-fp8\n```\n\nUse `vllm/vllm-openai:latest-cu129` for CUDA 12.x.\n\n### Install vLLM from Source\n\n```bash\nuv venv\nsource .venv/bin/activate\nuv pip install \"vllm==0.19.0\" --torch-backend=auto\nuv pip install \"transformers>=5.4.0\"\n```\n\n## Launching the Server\n\n### FP8 on 8xH200 with MTP\n\n```bash\nvllm serve zai-org/GLM-5.1-FP8 \\\n     --tensor-parallel-size 8 \\\n     --speculative-config.method mtp \\\n     --speculative-config.num_speculative_tokens 3 \\\n     --tool-call-parser glm47 \\\n     --reasoning-parser glm45 \\\n     --enable-auto-tool-choice \\\n     --chat-template-content-format=string \\\n     --served-model-name glm-5.1-fp8\n```\n\n## Client Usage\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8000/v1\")\n\n# Thinking ON (default)\nresp_on = client.chat.completions.create(\n    model=\"glm-5.1-fp8\",\n    messages=[{\"role\": \"user\", \"content\": \"Summarize GLM-5.1 in one sentence.\"}],\n    temperature=1,\n    max_tokens=4096,\n)\nprint(resp_on.choices[0].message.reasoning)\n\n# Thinking OFF\nresp_off = client.chat.completions.create(\n    model=\"glm-5.1-fp8\",\n    messages=[{\"role\": \"user\", \"content\": \"Summarize GLM-5.1 in one sentence.\"}],\n    temperature=1,\n    max_tokens=4096,\n    extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}},\n)\n```\n\n### cURL (Thinking ON)\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"glm-5.1-fp8\",\n    \"messages\": [\n      {\"role\": \"user\", \"content\": \"Summarize GLM-5.1 in one sentence.\"}\n    ],\n    \"temperature\": 1,\n    \"max_tokens\": 4096\n  }'\n```\n\n## Benchmarking\n\n```bash\nvllm bench serve \\\n  --model zai-org/GLM-5.1-FP8 \\\n  --dataset-name random \\\n  --random-input 8000 \\\n  --random-output 1024 \\\n  --request-rate 10 \\\n  --num-prompts 32 \\\n  --ignore-eos\n```\n\n## Troubleshooting\n\n- **Accuracy drift:** Prefer the 0.19.0 stable release for best accuracy.\n- **Tool calling + MTP:** If both are needed, use the latest vLLM main branch.\n- **FP8 installation:** DeepGEMM required for FP8 performance.\n\n## References\n\n- [Model card](https://huggingface.co/zai-org/GLM-5.1)\n- [FP8 checkpoint](https://huggingface.co/zai-org/GLM-5.1-FP8)\n- [DeepGEMM install script](https://github.com/vllm-project/vllm/blob/v0.16.0rc0/tools/install_deepgemm.sh)\n"
}