baidu/ERNIE-4.5-21B-A3B-PT
Baidu ERNIE 4.5 MoE text models (21B-A3B, 300B-A47B) with BF16 and FP8 support plus ERNIE-MTP speculative decoding
View on HuggingFaceGuide
Overview
ERNIE 4.5 is Baidu's MoE language model family. This recipe covers the text-only variants:
baidu/ERNIE-4.5-21B-A3B-PT— 21B total / 3B active (fits on 1x80GB)baidu/ERNIE-4.5-300B-A47B-PT— 300B total / 47B active (8x80GB FP8 or 16x80GB BF16)
Both support ERNIE-MTP speculative decoding via --speculative-config.
Prerequisites
- transformers >= 4.54.0
- vLLM >= 0.10.1
- Hardware depends on variant (see above)
Install vLLM
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
Launch commands
21B on 1x80GB GPU:
vllm serve baidu/ERNIE-4.5-21B-A3B-PT
300B on 8x80GB with vLLM FP8 online quantization:
vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--quantization fp8
300B on 16x80GB native BF16 (multi-node via Ray):
vllm serve baidu/ERNIE-4.5-300B-A47B-PT --tensor-parallel-size 16
ERNIE-MTP speculative decoding (21B example):
vllm serve baidu/ERNIE-4.5-21B-A3B-PT \
--speculative-config '{"method":"ernie_mtp","model":"baidu/ERNIE-4.5-21B-A3B-PT","num_speculative_tokens":1}'
Client Usage
Standard OpenAI-compatible API; model ID is the HF repo.
Benchmarking
vllm bench serve \
--model baidu/ERNIE-4.5-21B-A3B-PT \
--dataset-name random \
--random-input-len 8000 --random-output-len 1000 \
--request-rate 10 --num-prompts 16 --ignore-eos
Test configurations: prompt-heavy (8k/1k), decode-heavy (1k/8k), balanced (1k/1k).
Vary --num-prompts across 1, 16, 32, 64, 128, 256, 512.