internlm/Intern-S2-Preview
Scientific multimodal MoE (36B total / 3B active) continued pre-trained from Qwen3.5 — hybrid linear/full attention, 262K context, MTP-accelerated reasoning. BF16 and FP8 checkpoints.
35B-A3B scientific multimodal foundation model — single-node BF16 with MTP
View on HuggingFaceGuide
Overview
Intern-S2-Preview is a scientific multimodal foundation model from Shanghai AI Laboratory, continued pre-trained from Qwen3.5. It packs 36B total / 3B active parameters across 256 experts with hybrid linear/full attention, supports a 262K context, and ships with built-in shared-weight MTP for fast reasoning.
Beyond chat and reasoning, it adds vision and time-series modalities and improves agent capabilities for scientific workflows.
Prerequisites
- vLLM version: nightly build with
InternS2PreviewForConditionalGenerationsupport (PR #42705). The architecture is not yet in any stable release. - Hardware (BF16): 1x H200 (141 GB) or 2x H100/H800 (80 GB)
- Hardware (FP8): single H100/H200
- Trust remote code: required (custom modeling files ship in the repo)
Install vLLM (nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Launch commands
Recommended — with MTP speculative decoding
vllm serve internlm/Intern-S2-Preview \
--trust-remote-code \
--tensor-parallel-size 2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":4}'
Basic serving (no MTP)
vllm serve internlm/Intern-S2-Preview \
--trust-remote-code \
--tensor-parallel-size 2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Long-context (YaRN, up to 512K)
The base config sets max_position_embeddings = 262144. For longer contexts,
override the RoPE config to enable YaRN:
vllm serve internlm/Intern-S2-Preview \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-model-len 512000 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'
Client Usage
Recommended sampling parameters from the model card:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="internlm/Intern-S2-Preview",
messages=[{"role": "user", "content": "Design a synthesis route for paracetamol."}],
temperature=0.8,
top_p=0.95,
max_tokens=32768,
extra_body={
"top_k": 50,
"min_p": 0.0,
"spaces_between_special_tokens": False,
},
)
print(resp.choices[0].message.content)
Toggle thinking mode
Thinking is enabled by default. Disable it per request:
resp = client.chat.completions.create(
model="internlm/Intern-S2-Preview",
messages=[{"role": "user", "content": "What is AGI?"}],
temperature=0.8,
top_p=0.95,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
The model card notes: do not disable thinking mode for agentic tasks.
Troubleshooting
- Unknown architecture
InternS2PreviewForConditionalGeneration: the handler landed via vLLM PR #42705 — use a nightly wheel or build from main once the PR merges. - OOM at full 262K context: drop
--max-model-lento 65536 or 131072, or lower--gpu-memory-utilizationheadroom.