baidu/ERNIE-4.5-21B-A3B-PT

Baidu ERNIE 4.5 MoE text models (21B-A3B, 300B-A47B) with BF16 and FP8 support plus ERNIE-MTP speculative decoding

View on HuggingFace

moe21B / 3B131,072 ctxvLLM 0.10.1+text

Guide

Overview

ERNIE 4.5 is Baidu's MoE language model family. This recipe covers the text-only variants:

baidu/ERNIE-4.5-21B-A3B-PT — 21B total / 3B active (fits on 1x80GB)
baidu/ERNIE-4.5-300B-A47B-PT — 300B total / 47B active (8x80GB FP8 or 16x80GB BF16)

Both support ERNIE-MTP speculative decoding via --speculative-config.

Prerequisites

transformers >= 4.54.0
vLLM >= 0.10.1
Hardware depends on variant (see above)

Install vLLM

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

Launch commands

21B on 1x80GB GPU:

vllm serve baidu/ERNIE-4.5-21B-A3B-PT

300B on 8x80GB with vLLM FP8 online quantization:

vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95 \
  --quantization fp8

300B on 16x80GB native BF16 (multi-node via Ray):

vllm serve baidu/ERNIE-4.5-300B-A47B-PT --tensor-parallel-size 16

ERNIE-MTP speculative decoding (21B example):

vllm serve baidu/ERNIE-4.5-21B-A3B-PT \
  --speculative-config '{"method":"ernie_mtp","model":"baidu/ERNIE-4.5-21B-A3B-PT","num_speculative_tokens":1}'

Client Usage

Standard OpenAI-compatible API; model ID is the HF repo.

Benchmarking

vllm bench serve \
  --model baidu/ERNIE-4.5-21B-A3B-PT \
  --dataset-name random \
  --random-input-len 8000 --random-output-len 1000 \
  --request-rate 10 --num-prompts 16 --ignore-eos

Test configurations: prompt-heavy (8k/1k), decode-heavy (1k/8k), balanced (1k/1k). Vary --num-prompts across 1, 16, 32, 64, 128, 256, 512.