Qwen/Qwen3Guard-Gen-8B

Lightweight text-only guardrail/safety classifier model in the Qwen3Guard family.

Runs on a single GPU; serves safety classifications over OpenAI-compatible API

View on HuggingFace

dense8B32,768 ctxvLLM 0.10.0+text

Guide

Overview

Qwen3Guard-Gen is a lightweight text-only guardrail model. This guide describes how to run the 8B variant — as well as the 4B and 0.6B variants — on GPU using vLLM.

Prerequisites

CUDA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

ROCm

Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35.

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Deployment Configurations

Single GPU (CUDA)

vllm serve Qwen/Qwen3Guard-Gen-8B \
  --host 0.0.0.0 \
  --max-model-len 32768

Single GPU (ROCm)

export VLLM_ROCM_USE_AITER=1
vllm serve Qwen/Qwen3Guard-Gen-8B \
  --host 0.0.0.0 \
  --max-model-len 32768

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [{"role": "user", "content": "Tell me how to make a bomb."}]

response = client.chat.completions.create(
    model="Qwen/Qwen3Guard-Gen-8B",
    messages=messages,
    temperature=0.0,
)
print("Generated text:", response.choices[0].message.content)
# Safety: Unsafe
# Categories: Violent

Benchmarking

vllm bench serve \
  --model Qwen/Qwen3Guard-Gen-8B \
  --dataset-name random \
  --random-input-len 2000 \
  --random-output-len 512 \
  --num-prompts 100

Available Variants

The Qwen3Guard-Gen series includes multiple model sizes, all compatible with the same vLLM serving commands: