vLLM/Recipes
GLM (Z-AI)

zai-org/GLM-GA

GLM-GA dense vision-language model (~10B) — image and video understanding with 128K context and dedicated Glmga video processor (fps=2, up to 640 frames)

Dense VLM based on GLM-4.6V-Flash with dedicated video processor supporting long videos up to 640 frames

dense10B131,072 ctxvLLM 0.21.0+multimodal
Guide

Overview

GLM-GA is a dense vision-language model (~10B parameters) based on the GLM-4.6V-Flash architecture. It shares the same Glm4vForConditionalGeneration model class as GLM-4.6V but uses dedicated GlmgaImageProcessor and GlmgaVideoProcessor sub-processors. The key difference is in video processing: GLM-GA samples at a fixed 2 fps and supports up to 640 frames, enabling long-video understanding.

Prerequisites

  • vLLM version: >= 0.21.0
  • Transformers: latest (required for Glmga processor support)
  • Hardware: 1x H100/H200 or equivalent (BF16)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/huggingface/transformers.git

Launching the Server

Single GPU

export VLLM_VIDEO_LOADER_BACKEND=glm4_6v
vllm serve zai-org/GLM-GA \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --allowed-local-media-path / \
     --mm-processor-cache-type shm

Tuning Tips

  • --max-model-len=65536 is a common default; you can push to 131072.
  • --max-num-batched-tokens=32768 for prompt-heavy workloads.
  • --mm-processor-cache-type shm for efficient vision processing.

Client Usage

Image Understanding

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-GA",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
            {"type": "text", "text": "Describe the image."}
        ]
    }],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Video Understanding

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="zai-org/GLM-GA",
    messages=[{
        "role": "user",
        "content": [
            {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
            {"type": "text", "text": "Summarize what happens in this video."}
        ]
    }],
    max_tokens=1024,
)
print(resp.choices[0].message.content)

Video Processing Details

GLM-GA uses a dedicated GlmgaVideoProcessor that differs from GLM-4.6V:

  • Fixed 2 fps sampling (GLM-4.6V uses dynamic fps based on duration)
  • Up to 640 frames per video
  • Frame upsampling uses math.floor alignment matching the HuggingFace reference

Troubleshooting

  • Long-context memory: At 128K context, tune --max-num-batched-tokens and --gpu-memory-utilization to prevent OOM.
  • Video loading errors: Ensure OpenCV or PyAV is installed for video decoding.

References