meta-llama/Llama-3.1-8B-Instruct
Meta's Llama 3.1 8B dense instruction-tuned language model with 128K context
Guide
Overview
Llama 3.1 Instruct is Meta's instruction-tuned language model family. The 8B dense variant is lightweight and ideal for single-GPU deployment, with 128K context support. A 70B variant is also available (see related recipes).
TPU support is provided through vLLM TPU with a recipe for Trillium.
Prerequisites
- Hardware: 1x GPU with >=16 GB VRAM (e.g. A100, L40S, H100, H200)
- vLLM >= 0.6.0
- CUDA Driver compatible with your vLLM version
- Docker with NVIDIA Container Toolkit (recommended)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
Docker (Cloud TPU — Trillium)
TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium recipe, then run:
docker run -itd --name llama31-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 2048 \
--host 0.0.0.0 --port 8000
The 8B fits on a single chip (v6e-1, TP=1) — unlike the 70B, which needs a v6e-8 slice.
Client Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.choices[0].message.content)
Speculative Decoding (EAGLE3)
An EAGLE3 draft head is available at
RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3.
Enable the Spec Decoding toggle above, or add the --speculative-config manually:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"model":"RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3","method":"eagle3","num_speculative_tokens":3}'
EAGLE3 verification requires vLLM >= 0.9.0. num_speculative_tokens of 3 is a reasonable
starting point for chat workloads.
Troubleshooting
OOM on small GPUs:
Lower --max-model-len or --gpu-memory-utilization.
EAGLE3 draft head not loading: Upgrade vLLM to >= 0.9.0 — earlier releases don't support the EAGLE3 verification path.