stepfun-ai/Step-3.7-Flash
Production-grade vision-language MoE (~198B total / 11B active parameters) combining a 196B sparse language backbone with a 1.8B perception encoder, hybrid SWA/Global attention, and 3-way Multi-Token Prediction
Sparse MoE VLM with hybrid attention and 3-layer MTP speculative decoding
Guide
Overview
Step-3.7-Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun, pairing a 196B language backbone with a 1.8B perception encoder. It activates ~11B parameters per token and supports a 256k context window with three selectable reasoning levels (low / medium / high).
Key highlights:
- Multimodal Understanding: Native vision encoder for single and multi-image inputs alongside text
- Hybrid Attention Architecture: Interleaves Sliding Window Attention (512-token window) and Global Attention at a 3:1 ratio
- Sparse MoE: 11B active parameters out of 198B total
- Multi-Layer MTP: 3-way Multi-Token Prediction (MTP-3) for low-latency reasoning chains
Available precisions:
- stepfun-ai/Step-3.7-Flash (BF16)
- stepfun-ai/Step-3.7-Flash-FP8
- stepfun-ai/Step-3.7-Flash-NVFP4 (Blackwell only)
Prerequisites
- vLLM version: nightly (the model registry hasn't shipped in a stable release yet)
- Hardware: 8xH200/B200 for BF16 and FP8; 4xB200 for NVFP4
Install vLLM (nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly
Or via Docker:
docker pull vllm/vllm-openai:stepfun37
Launching the Server
BF16
vllm serve stepfun-ai/Step-3.7-Flash \
--served-model-name step3p7-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code
FP8
vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
--served-model-name step3p7-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code
NVFP4 (Blackwell only)
Requires modelopt quantization and FP8 KV cache alignment.
vllm serve stepfun-ai/Step-3.7-Flash-NVFP4 \
--served-model-name step3p7 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--quantization modelopt \
--kv-cache-dtype fp8 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--async-scheduling \
--trust-remote-code
Benchmarking
vllm bench serve \
--backend vllm \
--model stepfun-ai/Step-3.7-Flash \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
- MoE kernel tuning: See tune-moe-kernel to tune Triton kernels for your hardware.
- NVFP4 + TP > 4: The author recommends TP4+EP for NVFP4. Higher TP isn't validated.
- Cascade attention: Always pass
--disable-cascade-attn— the hybrid SWA/GA schedule is not compatible with cascade attention in vLLM.