baidu/ERNIE-4.5-VL-28B-A3B-PT
Baidu ERNIE 4.5 VL MoE vision-language models (28B-A3B, 424B-A47B) with heterogeneous text/vision experts
View on HuggingFaceGuide
Overview
ERNIE 4.5 VL is Baidu's multimodal MoE model with heterogeneous experts (separate text and vision experts). Because of the heterogeneous architecture, torch.compile and CUDA graphs are not supported.
baidu/ERNIE-4.5-VL-28B-A3B-PT— 28B total / 3B active (1x80GB)baidu/ERNIE-4.5-VL-424B-A47B-PT— 424B total / 47B active (8x140GB BF16, 8x80GB FP8+offload, or 16x80GB BF16)
Prerequisites
- vLLM: support added to main branch recently; install latest
- Hardware depends on variant
Install vLLM (CUDA)
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Install vLLM (AMD ROCm MI300X/MI325X/MI355X)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
Launch commands
28B on 1x80GB:
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT --trust-remote-code
424B BF16 on 8x140GB:
vllm serve baidu/ERNIE-4.5-VL-424B-A47B-PT \
--trust-remote-code \
--tensor-parallel-size 8
424B with FP8 + CPU offload on 8x80GB (testing only):
vllm serve baidu/ERNIE-4.5-VL-424B-A47B-PT \
--trust-remote-code \
--tensor-parallel-size 8 \
--quantization fp8 \
--cpu-offload-gb 50
28B on AMD MI300X+:
VLLM_ROCM_USE_AITER=1 SAFETENSORS_FAST_GPU=1 \
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-PT \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--disable-log-requests \
--trust-remote-code
Benchmarking
vllm bench serve \
--model baidu/ERNIE-4.5-VL-28B-A3B-PT \
--dataset-name random \
--random-input-len 8000 --random-output-len 1000 \
--request-rate 10 --num-prompts 16 --ignore-eos --trust-remote-code