vLLM/Recipes
StepFun

stepfun-ai/Step-3.7-Flash

Production-grade vision-language MoE (~198B total / 11B active parameters) combining a 196B sparse language backbone with a 1.8B perception encoder, hybrid SWA/Global attention, and 3-way Multi-Token Prediction

Sparse MoE VLM with hybrid attention and 3-layer MTP speculative decoding

moe198B / 11B262,144 ctxvLLM nightly+multimodal
Guide

Overview

Step-3.7-Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun, pairing a 196B language backbone with a 1.8B perception encoder. It activates ~11B parameters per token and supports a 256k context window with three selectable reasoning levels (low / medium / high).

Key highlights:

  • Multimodal Understanding: Native vision encoder for single and multi-image inputs alongside text
  • Hybrid Attention Architecture: Interleaves Sliding Window Attention (512-token window) and Global Attention at a 3:1 ratio
  • Sparse MoE: 11B active parameters out of 198B total
  • Multi-Layer MTP: 3-way Multi-Token Prediction (MTP-3) for low-latency reasoning chains

Available precisions:

Prerequisites

  • vLLM version: nightly (the model registry hasn't shipped in a stable release yet)
  • Hardware: 8xH200/B200 for BF16 and FP8; 4xB200 for NVFP4

Install vLLM (nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

Or via Docker:

docker pull vllm/vllm-openai:stepfun37

Launching the Server

BF16

vllm serve stepfun-ai/Step-3.7-Flash \
    --served-model-name step3p7-flash \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
    --trust-remote-code

FP8

vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
    --served-model-name step3p7-flash \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
    --trust-remote-code

NVFP4 (Blackwell only)

Requires modelopt quantization and FP8 KV cache alignment.

vllm serve stepfun-ai/Step-3.7-Flash-NVFP4 \
    --served-model-name step3p7 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --async-scheduling \
    --trust-remote-code

Benchmarking

vllm bench serve \
    --backend vllm \
    --model stepfun-ai/Step-3.7-Flash \
    --endpoint /v1/completions \
    --dataset-name random \
    --random-input 2048 \
    --random-output 1024 \
    --max-concurrency 10 \
    --num-prompt 100

Troubleshooting

  • MoE kernel tuning: See tune-moe-kernel to tune Triton kernels for your hardware.
  • NVFP4 + TP > 4: The author recommends TP4+EP for NVFP4. Higher TP isn't validated.
  • Cascade attention: Always pass --disable-cascade-attn — the hybrid SWA/GA schedule is not compatible with cascade attention in vLLM.

References