stepfun-ai/Step-3.5-Flash

Production-grade reasoning MoE (~196B total / 11B active parameters) with hybrid attention schedules, SWA compensation, and multi-token prediction for low-latency long-context inference

Sparse MoE reasoning model with hybrid attention and step3p5 MTP speculative decoding

View on HuggingFace

moe196B / 11B262,144 ctxvLLM 0.11.0+text

Guide

Overview

Step-3.5-Flash is an advanced reasoning engine from StepFun. Highlights:

Hybrid attention schedules with compensation for sliding-window attention (SWA)
Sparse MoE structure (196B total parameters, 11B active)
Multi-token prediction mechanism for faster inference

Available precisions:

stepfun-ai/Step-3.5-Flash (BF16)
stepfun-ai/Step-3.5-Flash-FP8
stepfun-ai/Step-3.5-Flash-Int4 (not yet supported by vLLM)

Prerequisites

vLLM version: latest stable
Hardware: 4x H200/H20/B200

Install vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

Launching the Server

Tensor Parallel

vllm serve stepfun-ai/Step-3.5-Flash \
    --tensor-parallel-size 4 \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --trust-remote-code

Note: The FP8 version cannot use TP4 — use DP4 instead.

Data Parallel + Expert Parallel (recommended for FP8)

vllm serve stepfun-ai/Step-3.5-Flash \
    --data-parallel-size 4 \
    --enable-expert-parallel \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --trust-remote-code

Enabling MTP Speculative Decoding

vllm serve stepfun-ai/Step-3.5-Flash \
    --tensor-parallel-size 4 \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --hf-overrides '{"num_nextn_predict_layers": 1}' \
    --speculative-config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}'

Benchmarking

vllm bench serve \
  --backend vllm \
  --model stepfun-ai/Step-3.5-Flash \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

MoE kernel tuning: See tune-moe-kernel to tune Triton kernels for your hardware.
FP8 DeepGEMM: For FP8, install DeepGEMM via install_deepgemm.sh.
B200 FlashInfer FP8 MoE error: If you see routing_logits must be bfloat16 when serving FP8 on B200, set export VLLM_USE_FLASHINFER_MOE_FP8=0 as a workaround.
FP8 + TP4 incompatibility: Use DP4+EP instead.