zai-org/GLM-GA
GLM-GA dense vision-language model (~10B) — image and video understanding with 128K context and dedicated Glmga video processor (fps=2, up to 640 frames)
Dense VLM based on GLM-4.6V-Flash with dedicated video processor supporting long videos up to 640 frames
Guide
Overview
GLM-GA is a dense vision-language model (~10B parameters) based on the
GLM-4.6V-Flash architecture. It shares the same Glm4vForConditionalGeneration
model class as GLM-4.6V but uses dedicated GlmgaImageProcessor and
GlmgaVideoProcessor sub-processors. The key difference is in video processing:
GLM-GA samples at a fixed 2 fps and supports up to 640 frames, enabling
long-video understanding.
Prerequisites
- vLLM version: >= 0.21.0
- Transformers: latest (required for Glmga processor support)
- Hardware: 1x H100/H200 or equivalent (BF16)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/huggingface/transformers.git
Launching the Server
Single GPU
export VLLM_VIDEO_LOADER_BACKEND=glm4_6v
vllm serve zai-org/GLM-GA \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--allowed-local-media-path / \
--mm-processor-cache-type shm
Tuning Tips
--max-model-len=65536is a common default; you can push to 131072.--max-num-batched-tokens=32768for prompt-heavy workloads.--mm-processor-cache-type shmfor efficient vision processing.
Client Usage
Image Understanding
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-GA",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
{"type": "text", "text": "Describe the image."}
]
}],
max_tokens=512,
)
print(resp.choices[0].message.content)
Video Understanding
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="zai-org/GLM-GA",
messages=[{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Summarize what happens in this video."}
]
}],
max_tokens=1024,
)
print(resp.choices[0].message.content)
Video Processing Details
GLM-GA uses a dedicated GlmgaVideoProcessor that differs from GLM-4.6V:
- Fixed 2 fps sampling (GLM-4.6V uses dynamic fps based on duration)
- Up to 640 frames per video
- Frame upsampling uses
math.flooralignment matching the HuggingFace reference
Troubleshooting
- Long-context memory: At 128K context, tune
--max-num-batched-tokensand--gpu-memory-utilizationto prevent OOM. - Video loading errors: Ensure OpenCV or PyAV is installed for video decoding.
References
- Model card
- GLM-4.6V-Flash (base architecture)
- vLLM multimodal inputs guide