internlm/Intern-S1
Intern-S1 vision-language model from Shanghai AI Lab with BF16/FP8 variants and thinking/non-thinking modes
View on HuggingFaceGuide
Overview
Intern-S1 is a vision-language model developed by Shanghai AI Laboratory. It supports thinking and non-thinking modes via chat-template kwargs and ships in BF16 and FP8 variants.
Prerequisites
- Hardware: 8xH800 (80GB) for BF16, 4xH800 for FP8, or 4-8x MI300X/MI325X/MI355X
- vLLM >= 0.10.0
Install vLLM (CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700
Launch commands
BF16 on 8xH800:
vllm serve internlm/Intern-S1 \
--trust-remote-code \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_r1 \
--tool-call-parser internlm
FP8 on 4xH800:
vllm serve internlm/Intern-S1-FP8 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_r1 \
--tool-call-parser internlm
FP8 on 8xMI300X/MI325X:
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=0
vllm serve internlm/Intern-S1-FP8 \
--trust-remote-code --tensor-parallel-size 8 \
--enable-auto-tool-choice --reasoning-parser deepseek_r1 --tool-call-parser internlm
FP8 on 8xMI355X: set only VLLM_ROCM_USE_AITER=1 (no need to disable AITER MoE).
Switching Between Thinking and Non-Thinking Modes
from openai import OpenAI
client = OpenAI(api_key="YOUR_API_KEY", base_url="http://0.0.0.0:8000/v1")
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": "9.11 and 9.8, which is greater?"}],
temperature=0.8, top_p=0.8,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(response)
Troubleshooting
ValueError: No available memory for the cache blocks.— add--gpu-memory-utilization 0.95.